Syllabus

GEO5165C: Quantitative Geography

Contact information

  • Instructor Name: Professor James B. Elsner (he/him/his)
  • Instructor Location: Bellamy Building, Room 323a
  • Lesson Hours: TR 3:05-4:20 p.m.
  • Student Hours: TR 9-10 a.m., 2-3 p.m.

Email:

Course description

This course is an introduction to the quantitative analysis of geographic data (data analysis for geographers). Most of the course content will be available through Canvas and through RStudio Cloud. Please open an account with RStudio Cloud at (https://rstudio.cloud).

Please use this link https://rstudio.cloud/spaces/12733/join?access_code=NuhGFcK71GlGuzoKzAUIe1lqgcMDyOIC7UnnFtNG to join the Spatial Data Analysis workspace on RStudio Cloud.

Expected learning outcomes

You will describe and demonstrate the principles of data science. You will do this with a grammar for manipulating data and a grammar for making graphs. The grammars are implemented in R using the syntax of tidyverse.

Materials

Class meetings

  • Online: synchronous, interactive, asynchronous recordings available on Canvas
  • Some lectures, lots of learn-by-doing

Grades

  • Grades are determined solely by how well you do on the regularly scheduled homework/classwork assignments.
  • There are NO quizzes, tests, or exams.
  • Synchronous attendance is expected but not required.
  • Late classwork or homework assignments will not be accepted.
  • Cumulative numerical averages of 90 - 100 (outstanding) are guaranteed at least an A-, 80 - 89 (good) at least a B-, and 70 - 79 (satisfactory) at least a C-, however the exact ranges for letter grades will be determined after all work is complete.

Students With Disabilities Act

Students needing academic accommodation should: (1) register with and provide documentation to the Student Disability Resource Center (https://dos.fsu.edu/sdrc/); (2) bring a letter to me indicating the need for accommodation and what type. This should be done sometime during the first week of classes.

Inclusiveness

It is my intent to present materials and activities that are respectful of diversity: gender identity, sexuality, disability, age, socioeconomic status, ethnicity, race, nationality, religion, and culture. Let me know ways to improve the effectiveness of the course for you personally, or for other students or student groups.

  • If you have a name and/or set of pronouns that differ from those that appear in your official FSU records, please let me know.
  • If you feel your performance is being impacted by your experiences outside of class, please don’t hesitate to come and talk with me. If you prefer to speak with someone outside of the course, your academic dean is an excellent resource.
  • If something was said in class (by anyone) that made you feel uncomfortable, please talk to me about it.

More about your instructor

Syllabus change policy

This syllabus is a guide for the course and is subject to change with advanced notice.

Schedule (subject to change with notice)

Week Dates Topic
1 August 24, 26, 28 RStudio Cloud and R
2 August 31, September 2, 4 Working with R
3 September 9, 11 Data and data frames
4 September 14, 16, 18 Data analysis
5 September 21, 23, 25 Graphical analysis
6 September 28, 30, October 1 Mapping
7 October 5, 7, 9 Bayesian data analysis
8 October 12, 14, 16 Regression
9 October 19, 21, 23 Multiple regression
10 October 26, 28, 30 Regression trees
11 November 2, 4, 6 Spatial data
12 November 9, 13 Spatial autocorrelation
13 November 16, 18, 19 Spatial autocorrelation
14 November 30, December 2, 3 Geographic regression

I will cover new material on Mondays and Wednesdays. On Fridays you will work on your assignment. Assignments are due Fridays at 5p.

Assignment Due Date (no later than 5 pm)
1 August 28
2 September 4
3 September 11
4 September 25
5 October 1
6 October 9
7 October 16
8 October 23
9 October 30
10 November 19
11 December 3

Other materials to check out

Julia programming language

Julia programming language https://julialang.org/ Download > Open

Jupyter Notebook (1) Anaconda > Individual > Download > Install (2) In the Julia REPL type: using Pkg Pkg.add(“IJulia”) (3) Then click on the Anaconda-Navigator icon and Launch Jupyter Notebook (4) Click on the New button and select Julia. Problems? watch https://www.youtube.com/watch?v=oyx8M1yoboY

Pluto Notebook In the Julia REPL type: import Pkg; Pkg.add(“Pluto”) import Pluto Pluto.run()

md""" # This Pluto notebook is a test. """ begin a = [1, 4, 7, 22] a * 10 end

Tuesday, August 23, 2022

  • Is it getting hotter here in Tallahassee?
  • Are Atlantic hurricanes getting stronger?

Data science (formerly known as ‘statistics’) is an exciting discipline that allows you to turn data into understanding, insight, and knowledge.

Today

  • Understand what this course is about, how it is structured, and my expectations for you
  • Start working with RStudio and R.

What is this course?

This is designed as first course in data science for geographers.

Q - What statistics background does this course assume?
A - None.

Q - Is this an intro stat course?
A - Statistics and data science are closely related with much overlap. Hence, this course is a great way to get started with statistics. But this course is not your typical high school/college statistics course.

Q - Will you be doing computing?
A - Yes.

Q - Is this an introduction to computer science course?
A - No, but many themes are shared.

Q - What computing language will you learn?
A - R.

Q - Why not language some other language?
A - We can discuss that over coffee.

Where are the materials for this course?

Github

RStudio Cloud

Join RStudio Cloud

Examples

Some of my recent research:

Other research:

Course Syllabus

  • Log on to RStudio Cloud and click on this course’s Space (Quantitative Geography Using R).
  • Click on the project 00_Syllabus and launch it.
  • Open the 00-Syllabus.Rmd file (lower-right panel), and then click on the “Knit” button.
  • Review

Tallahassee daily temperatures

  • Packages > Install
  • In the Packages window, type tidyverse, lubridate, here, ggplot2 then select Install

Get the data into your environment.

TLH.df <- readr::read_csv(file = here::here('data', 'TLH_SOD1892.csv'),
                          show_col_types = FALSE) |>
      dplyr::filter(STATION == 'USW00093805') |>
      dplyr::mutate(Date = as.Date(DATE)) |>
      dplyr::mutate(Year = lubridate::year(Date), 
             Month = lubridate::month(Date), 
             Day = lubridate::day(Date),
             doy = lubridate::yday(Date)) |>
      dplyr::select(Date, Year, Month, Day, doy, TMAX, TMIN, PRCP)

package::function (:: is called a library specifier).

Or, load the packages into your current environment with the library() function in the file above where they are first used.

Create a plot of the frequency of high temperatures.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

TLH.df |>
  group_by(TMAX) |>
  summarize(nH = n()) |>
ggplot(mapping = aes(x = TMAX, y = nH)) +
  geom_col(col = 'white', fill = "gray70") +
  labs(title = "Frequency of Daily High Temperatures",
       subtitle = "Tallahassee, FL, USA (1940-2018)",
       x = "Daily High Temperature (°F)",
       y = "Number of Days") +
 scale_x_continuous(breaks = seq(from = 20, to = 110, by = 10)) +
 theme_minimal()
## Warning: Removed 1 rows containing missing values (position_stack).

Thursday, August 25, 2022

Today

  • Data science: reproducibility, communication, and automation

  • Structure of markdown files

  • How to make a simple plot

  • Everything you create is an object

  • Turn off your camera.

  • Any questions about my grading of your assignments?

  • Make sure (1) you are watching (or at least listening) to me via Zoom, and (2) you have a copy of 02_Lesson project and have the 02-Lesson.Rmd file open.

  • Follow along in your copy of the lesson as I go line by line through the file on Zoom.

  • Your files background and text might look different. Is it? If so, got to Tools > Global Options > Appearance > Cobalt

  • Much of the lesson materials come from online books: https://www.bigbookofr.com/index.html

  • Datasets: https://kieranhealy.org/blog/archives/2020/08/25/some-data-packages/

Data Analysis

Data analytics are done on a computer. You have two choices: use a spreadsheet or write code.

A spreadsheet is convenient, but they make the three conditions for a good data analysis reproducibility, communication, and automation difficult to achieve.

Reproducibility

A scientific paper is advertisement for a claim. But the proof is the procedure that was used to obtain the result. SciencePaper

If your analysis is to be convincing, the trail from the data you started with to the final output must be available to the public. A reproducible trail with a spreadsheet is hard. It is easy to make mistakes (e.g., accidentally sorting just a column rather than the entire table).

A set of instructions written as computer code is the exact procedure. (Open stronger-hur.Rmd).

Communication

Code is a recipe for what you did. It communicates precisely what was done. Communication to others and to your future self.

It’s hard to explain to someone precisely what you did when working with a spreadsheet. Click here, then right click here, then choose menu X, etc. The words needed to describe these procedures are not standard. Code is an efficient way to communicate because all important information is given as plain text with no ambiguity.

Automation

If you’ve ever made a map using a geographic information system (GIS) you know how hard it is to make another one with a new set of data (even a very similar one). Running code with new data is simple.

Being able to code is an important skill for nearly all technical jobs. Here you will learn how to code. But keep in mind: Just like learning to write doesn’t mean you will be a writer (i.e., make a living writing), learning to code doesn’t mean you will be a coder.

The R programming language

  • R is a leading open source programming language for data science. R and Python.
  • Free, open-source, runs on Windows, Macs, etc. Excellent graphing capabilities. Powerful, extensible, and relatively easy to learn syntax. Thousands of functions.
  • Has all the cutting edge statistical methods including methods in spatial statistics.
  • Used by scientists of all stripes. Most of the world’s statisticians use it (and contribute to it).

Overview of this course

We start with making graphs. You will make clear, informative plots that will help you understand your data. You will learn the basic structure of a making a plot.

Visualization alone is not enough, so you will also learn the key verbs that allow you to select important variables, filter out key observations, create new variables, and compute summaries (data wrangling).

You will then combine data wrangling and visualization with your curiosity to ask and answer interesting questions by learning how to fit models to your data. Data models extend your ability to ask and answer questions about the world you live in.

With geographic and environmental data collected at different locations these models will include a spatial component.

Work in plain text, using R Markdown

The ability to reproduce your work is important to a scientific process. It is also pragmatic. The person most likely to reproduce your work a few months later is you.

This is especially true for graphs and figures. These often have a finished quality to them as a result of tweaking and adjustments to the details. This makes it hard to reproduce them later.

The goal is to do as much of this tweaking as possible with the code you write, rather than in a way that is invisible (retrospectively). Contrast editing an image in Adobe Illustrator.

You will find yourself constantly going back and forth between three things:

  1. Writing code: You will write code to produce plots. You will also write code to load your data (get your data into R), to look quickly at tables of that data. Sometimes you will want to summarize, rearrange, subset, or augment your data, or fit a statistical model to it. You will want to be able to write that code as easily and effectively as possible.

  2. Looking at output. Your code is a set of instructions that produces the output you want: a table, a model, or a figure. It is helpful to be able to see that output.

  3. Taking notes. You will also write about what you are doing, and what your results mean.

To do these things efficiently you want to write your code together with comments. This is where markdown comes in (files that end with .Rmd)

An R markdown file is a plain text document where text (such as notes or discussion) is interspersed with pieces, or chunks, of R code. When you Knit this file the R code is executed piece by piece, in sequence starting at the top of the file, and either supplementing or replacing the chunks of code with output.

The resulting file is then converted into a more easily-readable document formatted in HTML, PDF, or Word. The non-code segments of the document are plain text with simple formatting instructions (e.g., ## for section header).

There is a set of conventions for marking up plain text in a way that indicates how it should be formatted. Markdown treats text surrounded by asterisks, double asterisks, and backticks in special ways. It is R Markdown’s way of saying that these words are in

  • italics
  • also italics
  • bold, and
  • code font

Your class notes include code. There is a set format for including code into your markdown file (lines of code; code chunk). They look like this:

library(ggplot2)

I call these markings code chunk delimiters.

Three back ticks (on a U.S. keyboard, the character under the escape key) followed by a pair of curly braces containing the name of the language you are using. The format is language-agnostic and can be used with, e.g. Python and other languages.

The back ticks-and-braces signals that what follows is code. You write your code as needed, and then end the chunk with a new line containing three more back ticks.

If you keep your notes in this way, you will be able to see the code you wrote, the output it produces, and your own commentary or clarification on what the code did, all in a convenient way. Moreover, you can turn it into a good-looking document straight away with the Knit button.

This is how you will do everything in this course. In the end you will have a set of notes that you can turn into a book with bookdown.

Everything markdown

Visualizing data

To help motivate your interest in this course, we start by making a graph. There are three things to learn:

  1. How to create graphs with a reusable {ggplot2} template
  2. How to add variables to a graph with aesthetics
  3. How to select the ‘type’ of your graph with geoms

The following examples are taken from R for Data Science by Hadley Wickham and Garrett Grolemund, published by O’Reilly Media, Inc., 2016, ISBN: 9781491910399. https://r4ds.had.co.nz/.

A code template

Let’s begin with a question to explore.

What do you think: Do cars with big engines use more fuel than cars with small engines?

  • A: Cars with bigger engines use more fuel.
  • B: Cars with bigger engines use less fuel.

You check your answer with two things: the mpg data that comes in {ggplot2} and a plot. The mpg object contains observations collected on 38 models of cars by the US Environmental Protection Agency. Among the variables in mpg are:

  • displ, a car’s engine size, in liters.
  • hwy, a car’s fuel efficiency on the highway, in miles per gallon (mpg).

A car with a low fuel efficiency consumes more fuel than a car with a high fuel efficiency when they travel the same distance.

To see a portion of the mpg data, type mpg after you loaded the package using the library() function.

library(ggplot2)
mpg
## # A tibble: 234 × 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
## # … with 224 more rows

You see the first 10 rows and 10 columns of the data. Note that there are 234 rows and 11 columns so you are only viewing a portion of this spreadsheet.

Each row is a different car. The first row is the Audi A4 1999 model with automatic transmission (5 gears). The tenth car listed is the Audi A4 Quattro with manual transmission (6 gears).

The column labeled displ is the engine size in liters. Bigger number means the car has a larger engine. The column labeled hwy is the miles per gallon. Bigger number means the car uses more fuel to go the same distance (lower efficiency).

It is hard to check which answer is correct by looking only at these 10 cars. Note that bigger engines appear to have smaller values of highway mileage but it is far from clear.

You want to look at all 234 cars.

The code below uses functions from the {ggplot2} package to plot the relationship between displ and hwy for all cars.

Let’s look at the plot and then talk about the code itself. To see the plot, click on the little green triangle in the upper right corner of the gray shaded region.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

The plot shows an inverse relationship between engine size (displ) and fuel efficiency (hwy). Each point is a different car. Cars that have a large value of displ tend to have a small value of hwy and cars with a small value of displ tend to have a large value of hwy.

In other words, cars with big engines use more fuel. If that was your hypothesis, you were right!

Now let’s look at how you made the plot.

The code

Here’s the code used to make the plot. Notice that it contains three functions: ggplot(), geom_point(), and aes().

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

The first function, ggplot(), creates a coordinate system that you can add layers to. The first argument of ggplot() is the dataset to use in the graph.

By itself, ggplot(data = mpg) creates an empty graph, but it is not very interesting so I’m not going to show it here.

The function geom_point() adds a layer of points to the empty plot created by ggplot(). As a result, you get a scatterplot.

The function geom_point() takes a mapping argument, which defines which variables in your dataset are mapped to which axes in your graph. The mapping argument is always paired with the function aes(), which you use to bring together the mappings you want to create.

Here, you want to map the displ variable to the x axis (horizontal axis) and the hwy variable to the y axis (vertical axis), so you add x = displ and y = hwy inside of aes() (and you separate them with a comma). Where will ggplot() look for these mapped variables? In the data frame that you passed to the data argument, in this case, mpg.

  • Knit to generate HTML.
  • Compare the HTML with the Rmd.

A graphing workflow

The code above follows the common work flow for making graphs. To make a graph, you:

  1. Start the graph with ggplot()
  2. Add elements to the graph with a geom_ function
  3. Select variables with the mapping = aes() argument

A graphing template

In fact, you can turn your code into a reusable template for making graphs. To make a graph, replace the bracketed sections in the code below with a data set, a geom_ function, or a collection of mappings.

Give it a try!

  1. Copy and paste the above code chunk, including the code chunk delimiters, and replace the y = hwy with y = cty.
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = cty))

  1. Replace the bracketed sections < > with mpg, geom_boxplot, and x = class, y = hwy to make a slightly different graph.
ggplot(data = mpg) + 
 geom_boxplot(mapping = aes(x = class, y = hwy))

Common problems

As you start to work with R code, you are likely to run into problems. Don’t worry — it happens to everyone. I’ve been writing R code for decades, and I still write code that doesn’t work!

Start by comparing the code that you are running to the code in the examples in these notes. R is extremely picky, and a misplaced character can make all the difference. Make sure that every ( is matched with a ) and every " is paired with another ". Also pay attention to capitalization; R is case sensitive.

location of the + sign

One common problem when creating {ggplot2} graphics is to put the + in the wrong place: it must come at the end of a line, not the start. In other words, make sure you haven’t accidentally written code like this:

ggplot(data = mpg) 
+ geom_point(mapping = aes(x = displ, y = hwy))

help

If you’re still stuck, try the help. You can get help about any R function by running ?function_name in a code chunk, e.g. ?geom_point. Don’t worry if the help doesn’t seem that helpful — instead skip down to the bottom of the help page and look for a code example that matches what you’re trying to do.

If that doesn’t help, carefully read the error message that appears when you run your (non-working) code. Sometimes the answer will be buried there! But when you’re new to R, you might not yet know how to understand the error message. Another great tool is Google: try googling the error message, as it’s likely someone else has had the same problem, and has gotten help online.

Things to know

You are getting oriented to the language itself (what happens at the console), while learning to take notes in what might seem like an odd format (chunks of code interspersed with plain-text comments), in an IDE (integrated development environment) that that has many features designed to make your life easier in the long run, but which can be hard to decipher at the beginning. Here are some general points to keep in mind about how R is designed. They might help you get a feel for how the language works.

Everything has a name

In R, everything you deal with has a name. You refer to things by their names as you examine, use, or modify them. Named entities include variables (like x, or y), data that you have loaded (like my_data), and functions that you use. (More about functions soon.) You will spend a lot of time talking about, creating, referring to, and modifying things with names.

Things are listed under the Environment tab in the upper right panel.

Some names are forbidden. These include reserved words like FALSE and TRUE, core programming words like Inf, for, else, break, function, and words for special entities like NA and NaN. (These last two are codes designating missing data and “Not a Number,” respectively.) You probably won’t use these names by accident, but it’s good do know that they are not allowed.

Some names you should not use, even if they are technically permitted. These are mostly words that are already in use for objects or functions that form part of the core of R. These include the names of basic functions like q() or c(), common statistical functions like mean(), range() or var(), and built-in mathematical constants like pi.

Names in R are case sensitive. The object my_data is not the same as the object My_Data. When choosing names for things, be concise, consistent, and informative. Follow the style of the tidyverse and name things in lower case, separating words with the underscore character, _, as needed. Do not use spaces when naming things, including variables in your data.

Everything is an object

Some objects are part of R, some are added via packages, and some are created by you. But almost everything is some kind of object. The code you write will create, manipulate, and use named objects.

Let’s create a vector of numbers. The command c() is a function. It’s short for “combine” or “concatenate.” It will take a sequence of comma-separated things inside the parentheses and join them into a vector where each element is still accessible.

c(1, 2, 3, 1, 3, 5, 25)
## [1]  1  2  3  1  3  5 25

Instead of sending the result to the console, here you instead assign the result to an object.

my_numbers <- c(1, 2, 3, 1, 3, 5, 25)
your_numbers <- c(5, 31, 71, 1, 3, 21, 6)

To see what you created, type the name of the object and hit return.

my_numbers
## [1]  1  2  3  1  3  5 25

Each of our numbers is still there, and can be accessed directly if you want. They are now just part of a new object, a vector, called my_numbers.

You create objects by assigning them to names. The assignment operator is <-. Think of assignment as the verb “gets,” reading left to right. So, the bit of code above is read as “The object my_numbers gets the result of concatenating the following numbers: 1, 2, …”

The operator is two separate keys on your keyboard: the < key and the - (minus) key. When you create objects by assigning things to names, they come into existence in R’s workspace or environment.

You do things using functions

You do almost everything in R using functions. Think of a function as a special kind of object that can perform actions for you. It produces output based on the input that it receives. Like a good dog, when you want a function to do something, you call it. Somewhat less like a dog, it will reliably do what you tell it.

You give the function some information, it acts on that information, and some results come out the other side. Functions can be recognized by the parentheses at the end of their names. This distinguishes them from other objects, such as single numbers, named vectors, tables of data, and so on.

You send information to the function between the parentheses. Most functions accept at least one argument. A function’s arguments are the things it needs to know in order to do something. They can be some bit of your data (data = my_numbers), or specific instructions (title = "GDP per Capita"), or an option you want to choose (smoothing = "splines", show = FALSE).

For example, the object my_numbers is a numeric vector:

my_numbers
## [1]  1  2  3  1  3  5 25

But the thing you used to create it, c(), is a function. It combines the items into a vector composed of the series of comma-separated elements you give it. Similarly, mean() is a function that calculates a simple average for a vector of numbers. What happens if you just type mean() without any arguments inside the parentheses?

mean()

The error message is terse but informative. The function needs an argument to work, and you haven’t given it one. In this case, ‘x,’ the name of another object that mean() can perform its calculation on:

mean(x = my_numbers)
## [1] 5.714286

Or

mean(x = your_numbers)
## [1] 19.71429

While the function arguments have names that are used internally, (here, x =), you don’t strictly need to specify the name for the function to work:

mean(my_numbers)
## [1] 5.714286

If you omit the name of the argument, R will just assume you are giving the function what it needs, and in some order. The documentation for a function will tell you what the order of required arguments is for any particular function.

For simple functions that only require one or two arguments, omitting their names is usually not confusing. For more complex functions, you will typically want to use the names of the arguments rather than try to remember what the ordering is.

In general, when providing arguments to a function the syntax is <argument> = <value>. If <value> is a named object that already exists in your workspace, like a vector of numbers of a table of data, then you provide it unquoted, as in mean(my_numbers). If <value> is not an object, a number, or a logical value like TRUE, then you usually put it in quotes, e.g., labels(x = "X Axis Label").

Functions take inputs via their arguments, do something, and return outputs. What the output is depends on what the function does. The c() function takes a sequence of comma-separated elements and returns a vector consisting of those same elements. The mean() function takes a vector of numbers and returns a single number, their average.

Functions can return far more than single numbers. The output returned by functions can be a table of data, or a complex object such as the results of a linear model, or the instructions needed to draw a plot. They can even be other functions. For example, the summary() function performs a series of calculations on a vector and produces what is in effect a little table with named elements.

A function’s argument names are internal to that function. Say you have created an object in your environment named x, for example. A function like mean() also has a named argument, x, but R will not get confused by this. It will not use your x object by mistake.

As you have already seen with c() and mean(), you can assign the result of a function to an object:

my_summary <- summary(my_numbers)

When you do this, there’s no output to the console. R just puts the results into the new object, as you instructed. To look inside the object you can type its name and hit return:

my_summary
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.500   3.000   5.714   4.000  25.000

Functions come in packages (libraries)

The code you write will be more or less complex depending on the task you want to accomplish. Families of useful functions are bundled into packages that you can install, load into your R session, and make use of as you work.

Packages save you from reinventing the wheel. They make it so that you do not, for example, have to figure out how to write code from scratch to draw a shape on screen, or load a data file into memory.

Packages are also what allow you to build on the efforts of others in order to do your own work. {ggplot2} is a package of functions.

There are many other such packages and you will make use of several throughout this course, either by loading them with the library() function, or “reaching in” to them and pulling a useful function from them directly.

All of the work you will do this semester will involve choosing the right function or functions, and then giving those functions the right instructions through a series of named arguments.

Most of the mistakes you will make, and the errors you will fix, will involve having not picked the right function, or having not fed the function the right arguments, or having failed to provide information in a form the function can understand.

For now, just remember that you do things in R by creating and manipulating named objects. You manipulate objects by feeding information about them to functions. The functions do something useful with that information (calculate a mean, re-code a variable, fit a model) and give you the results back.

Try these out.

table(my_numbers)
## my_numbers
##  1  2  3  5 25 
##  2  1  2  1  1
sd(my_numbers)
## [1] 8.616153
my_numbers * 5
## [1]   5  10  15   5  15  25 125
my_numbers + 1
## [1]  2  3  4  2  4  6 26
my_numbers + my_numbers
## [1]  2  4  6  2  6 10 50

The first two functions here gave us a simple table of counts and calculated the standard deviation of my_numbers.

It’s worth noticing what R did in the last three cases. First you multiplied my_numbers by two. R interprets that as you asking it to take each element of my_numbers one at a time and multiply it by five. It does the same with the instruction my_numbers + 1. The single value is “recycled” down the length of the vector.

By contrast, in the last case we add my_numbers to itself. Because the two objects being added are the same length, R adds each element in the first vector to the corresponding element in the second vector.

Your turn

Create a code chunk to compute the coefficient of variation (standard deviation divided by the mean) for your numbers (my_numbers).

Tuesday, August 30, 2022

Today

  • More graphing examples
  • How R works

If your analysis is to be a convincing, the trail from data to final output must be open and available to all. Markdown helps you create scientific reports that are a mixture of text and code. This makes it easy to create an understandable trail from hypothesis, to data, to analysis, to results. Reproducible research.

Scatter plots

Functions from the {ggplot2} package are used to make graphs. You make these graphing functions available for a given session of R (every time you open RStudio) with the library(ggplot2) function.

As an example, consider the data frame called airquality. The data contains daily air quality measurements from a location in New York City between May and September of 1973.

Follow along by pressing the green arrows when you get to a code chunk.

head(airquality)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6
dim(airquality)
## [1] 153   6

The data contains 153 rows and 6 columns. Each row is a set of measurements across six variables on a given day.

Most data you will work with are like this. Each row is a set of measurements (a case) and each column is a variable.

The columns (variables) include the measurements of ozone concentration (Ozone) (ppm), solar radiation (Solar.R) (langley), wind speed (Wind) (mph), temperature (Temp) (F), as well as Month and Day.

Question: Are ozone concentrations higher on warmer days? Let’s see what the data say.

The scatter plot is one of the most useful statistical graphs. It describes the relationship between two variables. It is made by plotting the variables in a plane defined by the values of the variables.

Using the {ggplot2} functions, you answer the question above by mapping the Temp variable to the x aesthetic and the Ozone variable to the y aesthetic.

More simply you could say that you plot Temp on the x axis and Ozone on the y axis. Put you want to recognize that the axes are aesthetics (there are other aesthetics like color, size, etc).

library(ggplot2)

ggplot(data = airquality) + 
  geom_point(mapping =  aes(x = Temp, y = Ozone))
## Warning: Removed 37 rows containing missing values (geom_point).

What do you see? Why the warning?

To suppress the warning, you add the argument na.rm = TRUE in the geom_point() function.

ggplot(data = airquality) + 
  geom_point(mapping =  aes(x = Temp, y = Ozone), 
             na.rm = TRUE)

To help us better describe the relationship you add another layer. This layer is defined by geom_smooth() which takes the same aesthetics.

ggplot(data = airquality) + 
  geom_point(mapping =  aes(x = Temp, y = Ozone), na.rm = TRUE) +
  geom_smooth(mapping =  aes(x = Temp, y = Ozone), na.rm = TRUE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

The smooth line describes how the average ozone concentration varies with temperature. For lower temperatures there is not much change in ozone concentrations as temperatures increase, but for higher temperatures the increase in ozone concentrations is more pronounced.

In the above code you used the same mapping for the point layer and the smooth layer. You can simplify the code by putting the mapping = argument into the ggplot() function.

ggplot(data = airquality,
       mapping =  aes(x = Temp, y = Ozone)) + 
  geom_point(na.rm = TRUE) +
  geom_smooth(na.rm = TRUE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Question: On average is ozone concentration higher on windy days? Create a graph to help you answer this question.

ggplot(data = airquality, 
       mapping = aes(x = Wind, y = Ozone)) + 
  geom_point(na.rm = TRUE) +
  geom_smooth(na.rm = TRUE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

What is the answer?

You can use a label instead of a dot for the locations in this two-dimensional scatter plot by adding the label aesthetic and using geom_text.

ggplot(data = airquality, 
       mapping = aes(x = Wind, y = Ozone, label = Ozone)) +
  geom_text(na.rm = TRUE)

To map an aesthetic to a variable, associate the name of the aesthetic to the name of the variable inside aes().

You can make the plot interactive by using the ggplotly() function from the {plotly} package. You simply put the above code inside this function.

plotly::ggplotly(
  ggplot(data = airquality, 
         mapping =  aes(x = Temp, y = Ozone)) + 
  geom_point(na.rm = TRUE) +
  geom_smooth(na.rm = TRUE)
)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Hover/zoom etc.

As another example, consider the Palmer penguin data set from https://education.rstudio.com/blog/2020/07/palmerpenguins-cran/.

The data are located on the web at the following URL. You first save the location as an object called loc.

loc <- "https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv"

Note that this object is now located in our environment. It is simply a string of characters (letters, backslashes, etc) in quotes. A character object.

Next you get the data and save it as an object called penguins with the read_csv() function from the {readr} package. Inside the parentheses of the function you put the name of the location.

penguins <- readr::read_csv(loc)
## Rows: 344 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): species, island, sex
## dbl (5): bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, year
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Note that the object penguins is now in your environment. It is a data frame containing 344 rows (observations) and 8 variables. You list the first 10 rows and 7 columns by typing the name of the object as follows.

penguins
## # A tibble: 344 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <chr>   <chr>              <dbl>         <dbl>             <dbl>       <dbl>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # … with 334 more rows, and 2 more variables: sex <chr>, year <dbl>

The data are 344 individual penguins each described by species (Adelie, Chinstrap, Gentoo), where it was found (island name), length of bill (mm), depth of bill (mm), body mass (g), male or female, and year.

Each penguin belongs to one of three species. To see how many of the 344 penguins are in each species you use the table() function. Between the parentheses of this function you put the name of the data penguins followed by the $ sign followed by the name of the column species.

table(penguins$species)
## 
##    Adelie Chinstrap    Gentoo 
##       152        68       124

Said another way, you reference columns in the data with the $ sign so that penguins$species is how you refer to the column species in the data object named penguins.

There are 152 Adelie, 68 Chinstrap, and 124 Gentoo penguins.

You plot the relationship between flipper length and body mass for each of the three species.

ggplot(data = penguins, 
       mapping = aes(x = flipper_length_mm, y = body_mass_g)) + 
  geom_point()  
## Warning: Removed 2 rows containing missing values (geom_point).

Penguin flipper length and body mass show a positive relationship (association). Penguins with longer flippers tend to be larger.

How does this positive relationship vary by species?

You answer this question with another aesthetic. You assign a level of the aesthetic (here a color) to each unique value of the variable, a process known as scaling. The ggplot() function also adds a legend that explains which levels correspond to which values.

ggplot(data = penguins, 
       mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)) + 
  geom_point() +
  scale_color_manual(values = c("darkorange","darkorchid","cyan4")) 
## Warning: Removed 2 rows containing missing values (geom_point).

Returning to the mpg data set from last time.

ggplot(data = mpg,
       mapping = aes(x = displ, y = hwy, color = class)) + 
  geom_point()

The colors reveal that the unusual points (on the right side of the plot) are two-seaters. Sports cars have large engines like SUVs and pickup trucks, but small bodies like midsize and compact cars, which improves their gas mileage. In hindsight, these cars were unlikely to be hybrids since they have large engines.

In the above example, you mapped class to the color aesthetic, but you could have mapped class to the shape aesthetic, which controls point shapes.

ggplot(data = mpg,
       mapping = aes(x = displ, y = hwy, shape = class)) + 
  geom_point() +
  geom_smooth(method = lm, se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
## Warning: The shape palette can deal with a maximum of 6 discrete values because
## more than 6 becomes difficult to discriminate; you have 7. Consider
## specifying shapes manually if you must have them.
## Warning: Removed 62 rows containing missing values (geom_point).

What happened to the SUVs? The ggplot() function will only use six shapes at a time. By default, additional groups will go un-plotted when you use the shape aesthetic.

For each aesthetic, you use aes() to associate the name of the aesthetic with a variable to display. The aes() function gathers together each of the aesthetic mappings used by a layer and passes them to the layer’s mapping argument.

The syntax highlights a useful insight about x and y: the x and y locations of a point are themselves aesthetics, visual properties that you can map to variables to display information about the data.

You can also set the aesthetic properties of your geom manually. For example, you can make all of the points in our plot blue.

ggplot(data = mpg,
       mapping = aes(x = displ, y = hwy)) + 
  geom_point(color = "blue")

Here, the color doesn’t convey information about a variable, but only changes the appearance of the plot. To set an aesthetic manually, set the aesthetic by name as an argument of your geom function; i.e. it goes outside of aes(). You’ll need to pick a level that makes sense for that aesthetic:

  • The name of a color as a character string (with quotes).
  • The size of a point in millimeters.
  • The shape of a point as a number, as shown below.

R has 25 shapes that are identified by numbers. There are some seeming duplicates: for example, 0, 15, and 22 are all squares. The difference comes from the interaction of the color and fill aesthetics. The hollow shapes (0–14) have a border determined by color; the solid shapes (15–18) are filled with color; the filled shapes (21–24) have a border of color and are filled with fill.

Facets

Another way to add additional variables useful for categorical variables is to split the plot into facets. A facet is a subplot on one subset of the data.

A categorical variable is one that can take only a limited, and usually fixed, number of possible values so you can split the plot for each value of the categorical variable.

You can use facet_wrap() to create a faceted plot. The first argument of facet_wrap() is a formula, which you create with ~ followed by a variable name (here “formula” is the name of a data structure in R). The variable that you pass to facet_wrap() should only have a limited number of values (categorical).

The variable class in the data frame mpg is a character string. You can see this by typing

str(mpg)
## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
##  $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
##  $ model       : chr [1:234] "a4" "a4" "a4" "a4" ...
##  $ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
##  $ drv         : chr [1:234] "f" "f" "f" "f" ...
##  $ cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : chr [1:234] "p" "p" "p" "p" ...
##  $ class       : chr [1:234] "compact" "compact" "compact" "compact" ...

There are seven car classes. You put class in the facet_wrap() function. Everything is the same as before on the first two code lines but you add the facet_wrap() function.

ggplot(data = mpg,
       mapping = aes(x = displ, y = hwy)) + 
  geom_point() + 
  facet_wrap(~ class, nrow = 2) 

The output produces separate scatter plots one for each of the seven classes. More on graphs later.

Calculations

Let’s see how you can do some arithmetic in R.

R evaluates commands typed at the prompt and returns the result to the screen. The prompt is the blue greater than symbol (>). To find the sum of the square root of 25 and 2, at the prompt type

sqrt(25) + 2
## [1] 7

The number inside the brackets indexes the output. Here there is only one bit of output, the answer 7. The prompt that follows indicates R is ready for another command.

12/3 - 5
## [1] -1

How would you calculate the 5th power of 2? How would you find the product of 10.3 & -2.9? How would you find the average of 8.3 and 10.2?

How about 4.5% of 12,000?

.045 * 12000 
## [1] 540

Functions

Many math and statistical functions are available. A function has a name followed by a pair of parentheses. Arguments are placed inside the parentheses as needed.

For example,

sqrt(2)
## [1] 1.414214
sin(pi)
## [1] 1.224647e-16

How do you interpret this output? Type (highlight then click Run): .0000000000000001224647 Why not zero? What does the e-16 mean?

exp(1)
## [1] 2.718282
log(10)
## [1] 2.302585

Many functions have arguments with default values. For example, you only need to tell the random number generator rnorm() how many numbers to produce. The default mean is zero. To replace the default value, specify the corresponding argument name.

rnorm(10)
##  [1]  1.01202934  0.15811837  0.49029521 -0.09816279  0.24202958  1.73954980
##  [7]  0.32049056 -1.65061001  1.04496800 -2.20580096
rnorm(10, mean = 5)
##  [1] 4.547750 3.858727 5.942633 4.396579 5.421752 4.869154 4.166234 5.196839
##  [9] 5.835622 5.205494

Syntax is important

You get an error message when you type a function that R does not understand. For example:

squareroot(2)

Error: could not find function “squareroot”

sqrt 2

Error: syntax error

sqrt(-2)
## Warning in sqrt(-2): NaNs produced
## [1] NaN
sqrt(2

The last command shows what happens if R encounters a line that is not complete. The continuation prompt (+) is printed, indicating you did not finish the command.

Saving an object

Use the assignment operator to save an object. You put a name on the left-hand side of the left pointing arrow (<-) and the value on the right. Assignments do not produce output.

x <- 2 
x + 3    
## [1] 5
x <- 10

Here you assigned x to be a numeric object. Assignments are made using the left-pointing arrow (less than followed by a dash) [or an equal sign.]

Object names

You are free to make object names out of letters, numbers, and the dot or underline characters. A name starts with a letter or a dot (a leading dot may not be followed by a number). But you can’t use mathematical operators, such as +, -, *, and /.

Some examples of names include:

x <- 2
n <- 25
a.long.number <- 123456789
ASmallNumber <- .001

Case matters. DF is different than df or Df.

Some names are commonly used to represent certain types of data. For instance, n is for length; x or y are data vectors; and i and j are integers and indices.

These conventions are not forced, but consistent use of them makes it easier for you (and others) to understand what you’ve done.

Entering data

The c() function is useful for getting a small amount of data into R. The function combines (concatenates) items (elements). Example: consider a set of hypothetical annual land falling hurricane counts over a ten-year period.

2 3 0 3 1 0 0 1 2 1

To enter these into your environment, type

counts <- c(2, 3, 0, 3, 1, 0, 0, 1, 2, 1)
counts
##  [1] 2 3 0 3 1 0 0 1 2 1

Notice a few things. You assigned the values to an object called counts. The assignment operator is an equal sign (=). Values do not print. They are assigned to an object name.

They are printed by typing the object name as you did on the second line. Finally, the values when printed are prefaced with a [1]. This indicates that the object is a vector and the first entry in the vector is a value of 2 (The number immediately to the right of [1]). More on this later.

You can save some typing by using the arrow keys to retrieve previous commands. Each command is stored in a history file and the up arrow key will move backwards through the history file and the down arrow forwards. The left and right arrow keys will work as expected.

Applying a function

Once the data are stored in an object, you use functions on them. R comes with all sorts of functions that you can apply to your counts data.

sum(counts)
## [1] 13
length(counts)
## [1] 10
sum(counts)/length(counts)
## [1] 1.3

For this example, the sum() function returns the total number of hurricanes making landfall. The length() function returns the number of years, and sum(counts)/length(counts) returns the average number of hurricanes per year.

Other useful functions include, sort(), min(), max(), range(), diff(), and cumsum(). Try these on the landfall counts. What does range() do? What does diff() do?

Averge

The average (or mean) value of a set of numbers (\(x\)’s) is defined as: \[ \bar x = \frac{x_1 + x_2 + \cdots + x_n}{n} \] The function mean() makes this calculation on your set of counts.

mean(counts)
## [1] 1.3

Data vectors

The count data is stored as a vector. R keeps track of the order that the data were entered. First element,second element, and so on. This is good for a couple of reasons. Here the data has a natural order - year 1, year 2, etc. You don’t want to mix these. You would like to be able to make changes to the data item by item instead of entering the entire data again. Also vectors are math objects making them easy to manipulate.

Suppose counts contain the annual number of land-falling hurricanes from the first decade of a longer record. You want to keep track of counts over other decades. This could be done by the following, example.

cD1 <- counts
cD2 <- c(0, 5, 4, 2, 3, 0, 3, 3, 2, 1) 

Note that you make a copy of the first decade of counts and save the vector using a different object name.

Most functions operate on each element of the data vector at the same time.

cD1 + cD2
##  [1] 2 8 4 5 4 0 3 4 4 2

The first year of the first decade is added from the first year of the second decade and so on.

What happens if you apply the c() function to these two vectors?

c(cD1, cD2)
##  [1] 2 3 0 3 1 0 0 1 2 1 0 5 4 2 3 0 3 3 2 1

If you are interested in each year’s count as a difference from the decade mean, you type:

cD1 - mean(cD1)
##  [1]  0.7  1.7 -1.3  1.7 -0.3 -1.3 -1.3 -0.3  0.7 -0.3

In this case a single number (the mean of the first decade) is subtracted from each element of the vector of counts.

This is an example of data recycling. R repeats values from one vector so that its length matches the other vector. Here the mean is repeated 10 times.

Variance

Suppose you are interested in the variance of the set of landfall counts. The formula is given by: \[ \hbox{var}(x) = \frac{(x_1 - \bar x)^2 + (x_2 - \bar x)^2 + \cdots + (x_n - \bar x)^2}{n-1} \]

Note: The formula is given as LaTeX math code with the double dollar signs starting (and ending) the math mode. It’s a bit hard to read but it translates exactly to math as you would read it in a scientific article or textbook. Look at the HTML file.

Although the var() function will compute this for you, here you see how you could do this directly using the vectorization of functions. The key is to find the squared differences and then add up the values.

The key is to find the squared differences and then add them up.

x <- cD1
xbar <- mean(x)
x - xbar
##  [1]  0.7  1.7 -1.3  1.7 -0.3 -1.3 -1.3 -0.3  0.7 -0.3
(x - xbar)^2
##  [1] 0.49 2.89 1.69 2.89 0.09 1.69 1.69 0.09 0.49 0.09
sum((x - xbar)^2)
## [1] 12.1
n <- length(x)
n
## [1] 10
sum((x - xbar)^2)/(n - 1)
## [1] 1.344444

To verify type

var(x)
## [1] 1.344444

Data vectors have a type

One restriction on data vectors is that all the values have the same type. This can be numeric, as in counts, character strings, as in

simpsons <- c("Homer", "Marge", "Bart", "Lisa", "Maggie")
simpsons
## [1] "Homer"  "Marge"  "Bart"   "Lisa"   "Maggie"

Note that character strings are made with matching quotes, either double, ", or single, ’.

If you mix the type within a data vector, the data will be coerced into a common type, which is usually a character. Arithmetic operations do not work on characters.

Returning to the land falling hurricane counts.

cD1 <- c(2, 3, 0, 3, 1, 0, 0, 1, 2, 1)   
cD2 <- c(0, 5, 4, 2, 3, 0, 3, 3, 2, 1)

Now suppose the National Hurricane Center (NHC) reanalyzes a storm, and that the 6th year of the 2nd decade is a 1 rather than a 0 for the number of landfalls. In this case you type

cD2[6] <- 1    # assign the 6 year of the decade a value of 1 landfall

The assignment to the 6th entry in the vector cD2 is done by referencing the 6th entry of the vector with square brackets [].

It’s important to keep this in mind: Parentheses () are used for functions and square brackets [] are used to extract values from vectors (and arrays, lists, etc). REPEAT: [] are used to extract or subset values from vectors, data frames, matrices, etc.

cD2    #print out the values
##  [1] 0 5 4 2 3 1 3 3 2 1
cD2[2]  # print the number of landfalls during year 2 of the second decade
## [1] 5
cD2[4]  # 4th year count
## [1] 2
cD2[-4]  # all but the 4th year
## [1] 0 5 4 3 1 3 3 2 1
cD2[c(1, 3, 5, 7, 9)]   # print the counts from the odd years
## [1] 0 4 3 3 2

One way to remember how to use functions is to think of them as pets. They don’t come unless they are called by name (spelled properly). They have a mouth (parentheses) that likes to be fed (arguments), and they will complain if they are not feed properly.

Working smarter

R’s console keeps a history of your commands. The previous commands are accessed using the up and down arrow keys. Repeatedly pushing the up arrow will scroll backward through the history so you can reuse previous commands.

Many times you wish to change only a small part of a previous command, such as when a typo is made. With the arrow keys you can access the previous command then edit it as desired.

Thursday, September 1, 2022

Today

  • Data as vectors
  • Sample statistics
  • Structured data
  • Tables and summaries

Data as vectors

The c() function is used to get small amounts of data into R. The function combines (concatenates) items (elements). Example: consider a set of hypothetical annual land falling hurricane counts over a ten-year period.

2 3 0 3 1 0 0 1 2 1

To save these values in our environment as a data object, type

counts <- c(2, 3, 0, 3, 1, 0, 0, 1, 2, 1)
counts
##  [1] 2 3 0 3 1 0 0 1 2 1

Once data are stored as an object, you use functions on them. Some common functions used on simple data objects include

sum(counts)
## [1] 13
length(counts)
## [1] 10
sum(counts)/length(counts)
## [1] 1.3

For this example, the sum() function returns the total number of hurricanes making landfall. The length() function returns the number of years, and sum(counts)/length(counts) returns the average number of hurricanes per year.

Mean

The average (or mean) value of a set of numbers (\(x\)’s) is defined as: \[ \bar x = \frac{x_1 + x_2 + \cdots + x_n}{n} \]

Note: The formula is given as LaTeX math code with the double dollar signs starting (and ending) the math mode. It’s a bit hard to read but it translates exactly to math as you would read in a scientific article or textbook.

The function mean() makes this calculation on your set of counts.

mean(counts)
## [1] 1.3

The counts data is stored as a vector. R keeps track of the order that the data were entered. First element, second element, and so on. This is good for a couple of reasons. Here the data have a natural order - year 1, year 2, etc. You don’t want to mix these. You would like to be able to make changes to the data item by item instead of entering the entire data again. Also vectors are math objects making them easy to manipulate.

Suppose counts contain the annual number of land-falling hurricanes from the first decade of a longer record. You want to keep track of counts over other decades. This could be done by the following, example.

cD1 <- counts
cD2 <- c(0, 5, 4, 2, 3, 0, 3, 3, 2, 1)

Note that you make a duplicate copy of the vector called counts giving it a different name.

Most functions operate on each element of the data vector at the same time.

cD1 + cD2
##  [1] 2 8 4 5 4 0 3 4 4 2

The first year of the first decade is added to the first year of the second decade and so on.

What happens if you apply the c() function to these two vectors?

c(cD1, cD2)
##  [1] 2 3 0 3 1 0 0 1 2 1 0 5 4 2 3 0 3 3 2 1

If you are interested in each year’s count as a difference from the decade mean, you type:

cD1 - mean(cD1)
##  [1]  0.7  1.7 -1.3  1.7 -0.3 -1.3 -1.3 -0.3  0.7 -0.3

In this case a single number (the mean of the first decade) is subtracted from each element of the vector of counts.

This is an example of data recycling. R repeats values from one vector so that the length of this vector matches the other, longer vector. Here the mean is repeated 10 times.

Variance

Suppose you are interested in by how much the set of annual landfall counts varies from year to year. The formula for the variance is given by: \[ \hbox{var}(x) = \frac{(x_1 - \bar x)^2 + (x_2 - \bar x)^2 + \cdots + (x_n - \bar x)^2}{n-1} \]

Although the var() function will compute this, here you see how it can be computed from other simpler functions. The first step is to find the squared difference between each value and the mean. To simplify things first create a new vector x and assign the mean of the x’s to xbar.

x <- cD1
xbar <- mean(x)
x - xbar
##  [1]  0.7  1.7 -1.3  1.7 -0.3 -1.3 -1.3 -0.3  0.7 -0.3
(x - xbar)^2
##  [1] 0.49 2.89 1.69 2.89 0.09 1.69 1.69 0.09 0.49 0.09

The sum of the differences is zero, but not the sum of the squared differences.

sum((x - xbar)^2)
## [1] 12.1
n <- length(x)
n
## [1] 10
sum((x - xbar)^2)/(n - 1)
## [1] 1.344444

So the variance is 1.344. To verify with the var() function type

var(x)
## [1] 1.344444

Median

Recall that the mean is a statistic calculated on our data. Typically there are more data values close to the mean than far from it. A normal random variable is within two standard deviations of its mean about 95% of the time.

The median is a statistic defined exactly as the middle value.

For example, consider a set of seven data values. Here the seven values are generated randomly. The set.seed() function guarantees that everyone (with a particular seed number) will get the same set of values.

set.seed(3043)

y <- rnorm(n = 7)
sort(y)
## [1] -1.855028975 -1.536523195 -1.113848013 -0.863720993 -0.813241685
## [6]  0.002064746  1.024752099

The argument value n = 7 guarantees seven values. They are sorted from lowest on the left to highest on the right with the sort() function. The middle value is the fourth value from the left in the ordered list of data values.

median(y)
## [1] -0.863721

The median divides the data set into the top half (50%) of the data values and the bottom half of the data values.

With an odd number of values, the median is the middle one; with an even number of values, the median is the average of the two middle values.

y <- rnorm(n = 8)
sort(y)
## [1] -2.03716871 -1.32753574 -0.74852359 -0.62357212  0.07656504  0.50029011
## [7]  1.38629034  1.42971671
median(y)
## [1] -0.2735035

You check to see this is true no matter what the values are or what even number of values you choose.

N = 20
y <- rnorm(n = N)
y_sorted <- sort(y)
median(y) == (y_sorted[N/2] + y_sorted[N/2 + 1]) / 2
## [1] TRUE

The median value, as a statistic representing the middle of a set of data values, is said to be resistant to extreme values (outliers).

Consider the wealth (in 1000s of $) of five bar patrons.

patrons <- c(50, 60, 100, 75, 200)

Now consider the same bar and patrons after a multimillionaire walks in.

patrons_with_mm <- c(patrons, 50000)
mean(patrons)
## [1] 97
mean(patrons_with_mm)
## [1] 8414.167
median(patrons)
## [1] 75
median(patrons_with_mm)
## [1] 87.5

The difference in the mean wealth with and without the millionaire present is substantial while the difference in median wealth with and without the millionaire is small.

Statistics that are not greatly influenced be a few values far from the bulk of the data are called resistant.

The cfb data set from the {UsingR} package has data from the Survey of Consumer Finances conducted by the U.S. Federal Reserve Board (in 2001). Some of the income values are much higher than the bulk of the data. This tendency is common in income distributions. A few people tend to accumulate enormous wealth.

Make the data available with the library() function, then show the first ten rows and ten columns by typing the name of the data object (cfb).

library(UsingR)
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## Loading required package: HistData
## Loading required package: Hmisc
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
## 
## Attaching package: 'UsingR'
## The following object is masked from 'package:survival':
## 
##     cancer
cfb
##                  WGT AGE EDUC       INCOME CHECKING SAVING     NMMF  STOCKS
## X17470     5749.9746  54   14   66814.1946     6000   2000        0     500
## X315       5870.6340  40   12   42144.3381      400      0        0       0
## X8795      8043.6950  35   14   25697.7671     1000    160        0       0
## X10720     6092.8720  55   12   35976.8740     2600  19100        0       0
## X19170     7161.7566  40   12   39060.6061     1000   8300        0    3500
## X22075    11429.6335  82   12   13362.8389     1000      0    50000       0
## X12235     5988.0417  26   16   61674.6411     3000      0        0       0
## X7670      7111.7751  50   14   53451.3557     3100      0        0       0
## X16555     7602.8631  71   12   16446.5710     1000      0        0       0
## X370       9917.0148  70    6    9867.9426       50      0        0       0
## X7680      7263.7921  52   12   35976.8740     1700   3000     2000       0
## X6880      7039.9174  53   11    7195.3748        0      0        0       0
## X16570     6523.7932  27   16   78121.2121     8500   8000     1100    1500
## X12945     6490.4551  27   12   28781.4992        0      0     4000       0
## X6725      8265.3192  69   12   12334.9282      500   1600        0       0
## X15725     1616.6743  55   17  459476.0766        0      0        0       0
## X19880     6805.1027  42   14   54479.2663     3200     55        0       0
## X225       6865.3880  73   12   43172.2488    11000   2000   100000       0
## X4995      7731.3206  76   12   69897.9266   296440      0        0       0
## X7700      5693.9061  43   12   58590.9091      750   1700    13000       0
## X11375     6660.1557  48   11   52423.4450        0   1200    17600       0
## X17920     6764.6424  57   12   25697.7671     1600    900        0       0
## X12365     5591.7642  44   16   51395.5343      590   1780        0   14000
## X920       5812.9110  44   15   87372.4083      300  32700        0       0
## X19050     1022.1029  59   13   59618.8198     1500      0        0   75000
## X19555     8909.1588  47   14   25697.7671    17320    730        0       0
## X10520     4336.5281  25   14   26725.6778      800   1500        0       0
## X18705     8691.5555  28   16   71953.7480     4020  11830        0       0
## X5095      7620.1135  74   12   48311.8022     3500      0        0       0
## X11010     7683.5398  62   11    6475.8373        0    250        0       0
## X3540     10144.6672  23   12   28781.4992      420    340        0       0
## X14950     7328.9577  40   14   71953.7480    20800      0        0       0
## X4830      7069.5583  44   13    3700.4785      350   1000        0       0
## X2865     10911.3427  65   11   26725.6778     7000   6000     7500   22000
## X20945     6415.1554  35   17   54479.2663     1200   9310        0       0
## X13040     5263.6488  40   13   66814.1946        0    380        0       0
## X4515      5360.7266  33   11   28781.4992      500      0        0       0
## X145       5696.8902  21   11     513.9553       20     20        0       0
## X18685     8417.3121  63   13   41116.4274      180      0        0       0
## X17585     6373.6917  52   17   57562.9984     1000      0        0       0
## X10090     5114.4060  24   14   28781.4992        0      0        0       0
## X13235     5454.0787  29   14    9251.1962     2000  20000        0       0
## X3045      5454.0787  46   14   92511.9617     2500   1500    88000       0
## X21425     5696.2367  38   12   11307.0175       50     40        0       0
## X11840     5361.2218  34   13    7400.9569        0      0        0       0
## X3400      6327.2872  47   12   30837.3206     3500      0        0       0
## X6635      7173.3284  49   14   37004.7847     3200   2300        0       0
## X19815     6188.2375  83   12   25697.7671     5100   1800        0       0
## X19565     5788.8378  50   16   25697.7671     1500    350        0       0
## X12135     7998.0705  68   16  104846.8899    14600   7100        0   35000
## X10700     6501.2709  83   14   58590.9091        0      0   330000  275000
## X2600      7956.8927  28   17   61674.6411     6000      0    20000       0
## X2860      6604.7905  45   12   20558.2137     1500    530        0       0
## X2175      4522.0593  57   11   33921.0526      600   3000        0       0
## X14915     9185.1147  50   12  169605.2632     3500      0        0       0
## X66351     7173.3284  49   14   37004.7847     3200   2300        0       0
## X6575      1688.0257  40   14  153158.6922     2000      0     6000       0
## X8410      6793.3807  29   13   15418.6603     3500   3000        0       0
## X7230      5859.0521  59   15   15418.6603     2000   1850   200000       0
## X12955    10373.1531  69    6   12334.9282     1000      0        0       0
## X19205     7691.5051  44   12   53451.3557     1000   1050        0       0
## X600       5976.9863  59   12   38032.6954      990   1200        0       0
## X1290      6655.6238  22   14   15418.6603      250      0        0       0
## X17070      243.6350  40   16  925119.6172    22000      0   275000  175000
## X16140     6677.0208  68   14   44200.1595    22000  50000        0       0
## X17935     9636.8011  52   16   81204.9442        0      0        0       0
## X3605      5198.3414  39   16   47283.8915     4000   3000    25000       0
## X10275     5933.3748  43   16  144935.4067     7700  15300    17000   17000
## X19930     7944.1474  28   13   75037.4801     3200     50        0     500
## X15360     7421.8016  40   12  123349.2823     1000   6000    50000   20000
## X1075      7485.5250  77   12   12334.9282     1700  12390    43000       0
## X7770      9527.8477  78   16   34948.9633     2000    100        0       0
## X1010      6341.0975  58    8   12334.9282     1500   1200        0       0
## X7095      4293.7517  37   14   14390.7496      660      0        0       0
## X14255     7427.1703  78   12   11307.0175     2200      0        0       0
## X20075    10164.9687  56   14   37004.7847     1000      0        0       0
## X2610      5551.9820  28   12   28781.4992      270    300        0       0
## X965       5837.2792  83   12   14390.7496      600  20000        0       0
## X17515     6220.5890  48   16   41116.4274     2000   5000        0       0
## X1755      6270.0639  24    9    8223.2855        0      0        0       0
## X16440    11386.7530  57   13  113070.1754        0   3500        0    2000
## X14750     7029.1679  29   16   37004.7847      400    100        0       0
## X16960     8067.4672  36   17   63730.4625     2000      0    40000   20000
## X575       5111.3136  24   11   24669.8565        0      0        0       0
## X12340     7216.5318  79   16   25697.7671     7500      0    28000  190000
## X3250      5516.1522  40   17   12334.9282      300      0        0       0
## X21805     3597.7161  51   16   81204.9442     5000      0    50000   30000
## X17860     2751.6615  49   12    8223.2855    14600    500        0       0
## X6260      3036.6357  44   14  177828.5486        0      0    50000   85000
## X8435      4689.7790  25   15   40088.5167      160      0        0       0
## X10795     6313.4185  55   17   87372.4083     1500  96000    27000       0
## X9785      6018.8547  48    6   35976.8740     2850      0        0       0
## X17455     8340.9656  40   14   46255.9809     4000      0        0       0
## X11275    10483.6685  68   12   27753.5885     3300   8000    32000  116000
## X6785      7596.6439  45   12   56535.0877      500    505        0       0
## X12920     6468.5210  25   14    4625.5981        0      0        0       0
## X12685     6937.5423  50   17   50367.6236     1800   1850        0       0
## X7575      5875.4599  67   14   66814.1946      200      0        0       0
## X16745     8034.5602  30   14   52423.4450     2000      0        0       0
## X3925      6698.1550  28   12   37004.7847        0      0        0       0
## X13715     7485.8803  21   15   11307.0175      800    340        0       0
## X2630      6623.7739  31   16  149047.0494     2000   2300        0    1700
## X1880      7673.4807  42   17   12334.9282      220      0        0       0
## X16810     5375.4516  29   14   38032.6954       20      0        0       0
## X7535      5532.8460  23   14   25697.7671     1200      1        0       0
## X17395     4448.8961  36   13   31865.2313        0      0        0       0
## X20265     4733.4575  40   16   28781.4992      820    400        0   18000
## X16645     6010.7120  58   13  113070.1754     6030  10000        0       0
## X18180     4583.3587  52   12  117181.8182     5400  21000   250000       0
## X4825      5070.4577  38   13   33921.0526        0      0        0       0
## X1845      8154.7752  78   16   64758.3732     3500   1500        0   26000
## X5425     10038.8263  40   12  113070.1754     3000   1500        0    8300
## X10600     8502.3051  68   14   61674.6411     4000  36000        0       0
## X10360     8298.7768  68   11   10279.1069        0      0        0       0
## X19890     4456.3079  27   13   16446.5710      750   1550        0   17000
## X20500     8349.2691  26   10   19530.3030        0      0        0       0
## X2565      6641.8552  42   16   88400.3190     2000  44000   100000   20000
## X26002     7956.8927  28   17   61674.6411     6000      0    20000       0
## X19845     4405.0395  26   12   38032.6954      140      0        0       0
## X18965     8152.5724  59   12   43172.2488     9500   7700        0       0
## X11230     3934.7121  28   11   15418.6603        0      0        0       0
## X11260     7423.0858  75   14   67842.1053     3500   3400    45000  122000
## X3200      7098.8499  42   12   98679.4258     1600  13040     1000     600
## X5965      5871.1832  37   16   41116.4274      470    600     1300       0
## X107953    6313.4185  55   17   87372.4083     1500  96000    27000       0
## X11035     9078.7938  85   12   20558.2137        0      0        0       0
## X18245     5659.8661  54   17   91484.0510     7000   8200        0       0
## X11955     7244.9139  46   12   16446.5710    10000  10000        0       0
## X9345      6726.9283  92    9    3392.1053       80      0        0       0
## X2320      6434.5102  49   14   71953.7480     1300   1600        0       0
## X9295      1158.4185  71   16   65786.2839     3000    400        0       0
## X20110     5731.2341  48   14   51395.5343     5900  11000        0       0
## X680       6833.6584  48   16  100735.2472     3800      0    13000       0
## X13270     7537.6703  37   17   82232.8549     3200   1950        0       0
## X3075      7190.2136  42   17   40088.5167      200   2150        0       0
## X13160     9388.0984  42   17   61674.6411    15000    100        0       0
## X20435     3133.2430  35   17 1182097.2887        0      0        0  375000
## X12465     2146.5932  52   16   51395.5343    18000  40000        0  600000
## X4440      4599.0191  60   12    8223.2855      660    100        0       0
## X3870      7560.2604  63   12   29809.4099     1500      0        0       0
## X3510      6655.2299  40   12   80177.0335     1000      0        0       0
## X13795     6664.1853  18   10    7812.1212        0      0        0       0
## X18155     4538.6282  50   16   61674.6411      500    500        0       0
## X4685      7123.2132  57   16   51395.5343     1000  30000        0    5700
## X20135     4921.4820  44   12   29809.4099      470    300        0       0
## X7975     10857.6915  77    8   15418.6603        0  40000        0       0
## X16425     6688.8349  53   11   82232.8549     2000    300        0       0
## X84354     4689.7790  25   15   40088.5167      160      0        0       0
## X12905     7233.3450  62    2   71953.7480     1000    500        0    2500
## X15095     7819.0561  86   14    7195.3748     1010 132000        0       0
## X3625      7581.8314  34   11    6989.7927        0      0        0       0
## X198455    4405.0395  26   12   38032.6954      140      0        0       0
## X570      10431.8465  47   12   45228.0702      500      1        0       0
## X21195     6578.5191  74   16   38032.6954     5200      0      190   50000
## X16470     3597.7161  43   12 1408237.6396       10      0        0       0
## X14880     5711.2392  52   12   94567.7831     2150   2610        0       0
## X9485      8780.1580  55   10    1439.0750        5    400        0       0
## X17090     5797.9275  33   12   10279.1069      700      0        0       0
## X9670     11386.7530  45   16   92511.9617     5500      0        0       0
## X15945     4792.5122  44   12   10279.1069        0      0        0       0
## X13535     5532.8460  23   10   29809.4099      200    200        0       0
## X3685      7486.2704  48   10   16446.5710        0      0        0       0
## X540       6746.5369  45   12   40088.5167      750    180        0       0
## X17780     6655.8875  51   16   71953.7480      500   8000        0     400
## X21100     3253.9699  49   14   90456.1404        0  10000   100000  300000
## X4310      9939.8329  34   17   47283.8915     1500      0        0       0
## X2010      8301.2131  49   12   38032.6954      500    700        0       0
## X8785      6388.3726  55   12   48311.8022     2000      0        0       0
## X1045      7700.3724  56    9   34948.9633      500   5000        0    3000
## X2935      8045.5847  76   12   24669.8565     1500      0        0       0
## X11195     7192.3659  21   12   46255.9809        0      0        0  120000
## X110356    9078.7938  85   12   20558.2137        0      0        0       0
## X3410      6611.1226  50   16   67842.1053       10    700        0       1
## X17765     6235.1707  83   15    7812.1212     2000      0        0       0
## X9175      3265.4868  46   16   71953.7480     1000   1200    10000       0
## X6395      5644.9880  28   12   26725.6778        0    300        0    5700
## X485       5154.0603  49   16   19530.3030        0      0        0       0
## X870       1173.9354  40   16   80177.0335     1640   4100        0    1700
## X9220      4897.5131  37    9   12334.9282       40      0        0       0
## X1920      7487.6105  63   12   12334.9282      400      0        0       0
## X19230     8742.7099  63   15   27753.5885     1850      0    80000   75000
## X18475     2133.9750  67   17   53451.3557     4000      0   421000  375000
## X5895      5446.1083  45    7   17474.4817        0    520        0       0
## X3695     10109.2136  49   17   76065.3908     2000    500        0       0
## X17075     7726.8088  31   17   63730.4625        0   5000        0       0
## X21685     6899.7143  37   13   21586.1244      800      0        0       0
## X10410     5134.3240  25   16   50367.6236        0    300        0       0
## X1350      5540.6097  31   13   12334.9282        0      0        0       0
## X18760     5988.3062  43   17   71953.7480      500      0        0       0
## X3405      5303.8926  27   16   26725.6778      510     15        0       0
## X12035     5803.8741  35   12   46255.9809      770      0        0       0
## X305       6313.9774  47   14   68870.0159     2000  16100     5500       0
## X17850     7666.5600  72   12  185023.9234     3500      0        0       0
## X4110      1503.1836  38   16 1541866.0287        0      0  1530000  300000
## X4605      6478.4991  62   12   19530.3030     1800  15000        0       0
## X12555     4686.2076  25   11   26725.6778        0      0        0       0
## X5915      3330.3623  54   17  332015.1515    15000  23500   125000       0
## X22035     4823.1376  58    1    6578.6284        0      0        0       0
## X6930      5808.7163  31   12   58590.9091    12000  16500        0    5500
## X17060    10597.7984  80   10   23641.9458     4800      0        0       0
## X13760     6133.1493  57   12   53451.3557       50  18700        0       0
## X5825      6661.3144  56   16   31865.2313     1300      0        0       0
## X34057     5303.8926  27   16   26725.6778      510     15        0       0
## X20180     8410.7240  61   15  101763.1579      450      0        0       0
## X21130    11097.5342  78   12   15418.6603      310      0        0       0
## X12205     4681.8403  46   14   12334.9282        0      0        0       0
## X1265      9929.1222  77   12   25697.7671        0      0        0       0
## X13645    10246.9474  81   12   44200.1595    51000      0        0       0
## X905       7456.1503  23   11   12334.9282      700      0        0       0
## X21995     5929.3158  83   12   10279.1069      300   6000        0       0
## X6975      9338.9337  78   16   24669.8565        0   1100        0       0
## X16450     5872.4153  40   12   22614.0351     2500    120        0       0
## X14840     5671.0347  80   13   15418.6603     3700   3300        0       0
## X8300      6136.6248  46   14   52423.4450     1700   6150        0       0
## X645       2797.2649  52   17  192219.2982     2000   2000        0       0
## X2770      7022.5454  62   16   47283.8915     2700   2000        0       0
## X147508    7029.1679  29   16   37004.7847      400    100        0       0
## X1540      6385.0040  35   11   40088.5167      320   3240        0       0
## X19435     5019.3357  27   11    5961.8820        0      0        0       0
## X6765      9419.1196  72   16   30837.3206      400      0        0       0
## X54259    10038.8263  40   12  113070.1754     3000   1500        0    8300
## X19980     7630.0979  86   17   37004.7847    10000  20000        0       0
## X54010     6746.5369  45   12   40088.5167      750    180        0       0
## X21890     6316.2726  39   12   87372.4083     1000  15650        0       0
## X1220      8765.8772  76    8   23641.9458        0  18000        0       0
## X16615      837.3098  46   16  153158.6922     5000      0        0  750000
## X16905    11386.7530  76   16   28781.4992     5600   6800    48000  100000
## X9050      1101.0772  46   17  223056.6188        0      0    80000   14000
## X21165     5386.4622  40   14   71953.7480        0 110000   135000       0
## X16350     5073.3726  26   11    5653.5088        0      0        0       0
## X14085     5169.3498  56   14   35976.8740      200      0        0       0
## X11465     5134.4672  54   12   68870.0159      400    900        0       0
## X12610     1725.2995  60   14   20558.2137    20000      0        0       0
## X785       5496.3173  24   16   30837.3206     1000      0        0       0
## X14485     6354.0137  45   13   17474.4817     1500    360        0       0
## X8580      7333.3380  40   12   39060.6061     1000      0        0       0
## X10340     6355.6933  25   14   67842.1053      500      0        0       0
## X20855     5483.6654  75    8    6270.2552      200    450        0       0
## X5420      7143.7855  43   15   58590.9091       50   1960     1500       0
## X1200      7770.1210  49   12   66814.1946     1200   4900        0       0
## X13395     6239.6906  29   11  125405.1037     2000   7000        0    6000
## X10230     7426.1415  49   14   30837.3206        0  10000        0       0
## X17945    10038.8263  39   17   37004.7847      700    580        0       0
## X565       6382.7943  47   12   10279.1069      300      0        0       0
## X18070     7659.5207  88    7   16446.5710      900   2700        0    2000
## X509511    7620.1135  74   12   48311.8022     3500      0        0       0
## X8940      5120.7298  43   12   54479.2663     3500  15000        0       0
## X11575     9604.9903  66   17   43172.2488     1400      0        0    3600
## X1213512   7998.0705  68   16  104846.8899    14600   7100        0   35000
## X14770     8130.4052  44   14   96623.6045     1800   3800        0       0
## X22015     6105.5579  54   14   14390.7496     1000   1300        0       0
## X4965      5025.3417  49   10       0.0000        0      0        0       0
## X1660      8149.6942  44   13   62702.5518     2500   6000        0       0
## X20795     6880.3630  19   13   19530.3030      470     40        0    4000
## X2045      7719.5388  21   13   35976.8740     2000      0        0       0
## X10235     9916.6980  85   16   24669.8565        1      0        0  180000
## X12060     7335.8692  23   12   22614.0351      100    120        0       0
## X5680      8288.4412  57   14  129516.7464     4900   5000        0       0
## X20215     5133.1927  52   12   19530.3030     2250    410        0       0
## X15375     7898.4771  80    6   17474.4817        0   2500        0       0
## X10740     9507.0043  78    5    7709.3301        0      0        0       0
## X4160      4372.7256  68   16   76065.3908     7000      0   163000  112000
## X310       5950.2488  40   12   21586.1244     2000    500        0       0
## X3235      6509.0382  57   12   51395.5343      700   2200        0       0
## X21055     8250.0749  67   12   12334.9282    10220      0        0       0
## X2620      5284.0466  28   12       0.0000        0      0        0       0
## X1600      4660.6242  37   17   75037.4801     2500    220        0       0
## X1751513   6220.5890  48   16   41116.4274     2000   5000        0       0
## X5765      6225.7422  59   14   41116.4274        0      0        0       0
## X16945     6440.1730  79   12   39060.6061    27000  15000        0       0
## X20830      236.7943  57   17  429666.6667        0      0   150000       0
## X10105    10483.6685  65   17   35976.8740        0      0        0    9000
## X4895      8641.0258  36   12   30837.3206      400    700        0       0
## X9895      4920.1955  55   14   14390.7496        0      0        0       0
## X10650     7902.0620  48   12   19530.3030      100      0        0       0
## X8705      6661.6101  54   12   85316.5869     2500    500        0       0
## X1490      7291.4425  85   12   82232.8549    16000      0        0       0
## X341014    6611.1226  50   16   67842.1053       10    700        0       1
## X1408515   5169.3498  56   14   35976.8740      200      0        0       0
## X16235     7640.4959  28   13  106902.7113     1200   2420        0       0
## X2201516   6105.5579  54   14   14390.7496     1000   1300        0       0
## X17115     6487.3485  43   15   57562.9984     2500   4600        0    1000
## X22110     8414.4992  38   12   56535.0877     1500   1750        0       0
## X5075      8472.5536  78   11   75037.4801     8600    950        0       0
## X3895      6208.4334  54   16   83260.7655    16300      0    62700   56000
## X18550     6467.8824  41   14  152130.7815      150      0        0       0
## X1998017   7630.0979  86   17   37004.7847    10000  20000        0       0
## X10815     5992.9396  56   12   28781.4992      900    560        0       0
## X130       6832.8261  50   17   64758.3732     1100  14700        0       0
## X15700     9210.3635  42   17  115125.9968     5600   1900        0       0
## X10560     7091.0393  78    6    8223.2855      660      0        0       0
## X8180      7339.2602  55   12   15418.6603        0      0        0       0
## X6115      7434.1715  55   13   61674.6411     2000   1000        0       0
## X11495     9240.9040  44   16   81204.9442     5700   2200        0    8800
## X17710     8507.0410  77    9   23641.9458    13000      0    10000   80000
## X10510     7534.6363  19   12    5653.5088        0      0        0       0
## X10990     5651.0074  45   11   35976.8740     2400      0        0    1200
## X13300     5711.5575  59   12   16446.5710      100     70        0       0
## X19315     6143.5717  37   12   25697.7671     1600      5        0       0
## X10685     7444.5161  45   12   23641.9458        0      0        0       0
## X19330     6984.1467  48   14   71953.7480     2000   5600        0   15000
## X16260     6003.7896  46   10    6167.4641        0     10        0       0
## X13945     9197.4307  35    9   34948.9633     1500   5100        0       0
## X2330      6659.7322  46   17   20558.2137     1000    400     6000   40000
## X12080     5664.1469  20   14   21586.1244      700      0        0       0
## X16900     3237.7455  69   17  415275.9171   234400      0        0 1000000
## X1080     11386.7530  31   17   55507.1770      810  45000        0       0
## X19180     2812.2327  53   16   87372.4083        0  80350   236000   20000
## X2925      6746.5369  38   16  149047.0494     2000  15000        0       0
## X7555      7486.2469  62   12   35976.8740     7500      0    42000       0
## X16600     8180.4213  57   17  166521.5311     3000   2700        0       0
## X16795     7270.1531  37   12   65786.2839      400     50        0       0
## X16545     7485.1324  60   12   44200.1595     2000   1500        0       0
## X20245     6984.7124  25   12   56535.0877     2000      0        0       0
## X9180      6503.6143  32   17   87372.4083        0   5500     4000   50000
## X16480     9494.7847  76   14   69897.9266     1500      0   173000  100000
## X17355    11253.9904  57   12   66814.1946     1610    400        0       0
## X5875      7786.5650  80   10    8326.0766     6000   1000        0       0
## X16145     7168.1489  45   14   50367.6236        0   3000        0    2500
## X21770     5079.5219  62    8    7298.1659        0      0        0       0
## X11820     5689.9210  38   16   69897.9266     2501   1200    48000       0
## X9390      7393.9024  65   16   14390.7496     1000   3000     2000    2000
## X12520     6332.1609  67   10   14390.7496     6000      0        0       0
## X12040    11386.7530  76   16   37004.7847     2500  11000        0   38000
## X8890     10431.8465  50   11  114098.0861    15000      0        0    8300
## X150       6767.4428  29   13   47283.8915      100      0        0    7000
## X4600      7132.4826  66    7   34948.9633     1500    800        0       0
## X12490     7344.0563  70   12   13362.8389       40      0        0       0
## X5640      8473.5305  43   12   17474.4817        0   1700        0       0
## X1758518   6373.6917  52   17   57562.9984     1000      0        0       0
## X3105      1720.8497  54   17  444057.4163        0      0  3900000  100000
## X1070019   6501.2709  83   14   58590.9091        0      0   330000  275000
## X7035      8463.5537  49   16   54479.2663        0      0        0       0
## X19950     4181.7756  53   16  243614.8325     8300   3010        0    2300
## X12835     4692.9691  39   12   40088.5167     1500     15        0       0
## X12050     5651.0074  48   12   67842.1053     2500   2800        0       0
## X12605     7292.5295  48   14  116153.9075     2800    410        0       0
## X16605    10121.8582  74   14   67842.1053     6500      0        0       0
## X100       6078.5087  65    9    7914.9123        0      0        0       0
## X21095     6777.8412  51   14   76065.3908      600      0        0       0
## X9640      8069.9291  52   12   35976.8740      400      0        0       0
## X18820     1099.8573  50   17  248754.3860   237500 250000        0    2200
## X20565    11386.7530  40   17  193247.2089     5000   1000        0   23000
## X13035     8220.9304  83   14   81204.9442     7000 193000        0       0
## X15555      307.3822  48   16  176800.6380     5200   1000   650000       0
## X8315      6657.7220  69   16   51395.5343     1000 115000        0   10000
## X990       6273.6366  28   11   69897.9266     3000      0        0       0
## X19305     4792.5614  47   17   93539.8724     5400      0        0   15000
## X14165     8909.1588  54   14   69897.9266     2500      0    28000   62000
## X1785      9471.2816  63   12   21586.1244     2300      0        0       0
## X22045     3394.2432  68   12   80177.0335     2000  15000        0   25000
## X31520     5870.6340  40   12   42144.3381      400      0        0       0
## X21535     8618.8325  66   12    9765.1515      550   5600        0       0
## X8005      7407.0569  45   17   51395.5343     8000      0        0       0
## X21855     5619.8455  28   14   56535.0877     1600   1430        0       0
## X2965      8016.8139  54   16   92511.9617     1100    400    11000   10000
## X19925     8390.7838  38   12   35976.8740     3500  77000    63000       0
## X21305     7544.4668  35    5   10279.1069        0      0        0       0
## X11315    11291.1815  63   17   75037.4801     7200   6200        0   45000
## X2870      6938.1564  37    8   16446.5710        0    200        0       0
## X7845      5128.3639  24    9   37004.7847      200      0        0       0
## X11325     9377.8584  41   13   80177.0335      200   2200        0     300
## X7135      6633.8745  24   14   32893.1419     1700      0      600       0
## X1223521   5988.0417  26   16   61674.6411     3000      0        0       0
## X19955     5867.5632  55   12   43172.2488     7000      0    62000       0
## X12115     4344.4525  36   13   25697.7671        0   2000        0       0
## X20770     4390.4068  26   10   10279.1069        0      0        0       0
## X695       7345.0840  45   12   39060.6061        0      0        0       0
## X11320    10038.8263  41   16  128488.8357     5000      0        0   20000
## X7080      6071.8608  61   12   27753.5885     1300      0   330000   90000
## X21705     6325.1681  42   12   20558.2137        0      0        0       0
## X10855     7418.4265  50   14  102791.0686     5300   1700        0       0
## X6340      7390.7142  46   16   82232.8549     2000    100        0       0
## X17450     7513.2426  57   12   11307.0175        0      0        0       0
## X2895      8886.6097  83   12   13362.8389        1   1000        0       0
## X8115      8155.4622  78   17   98679.4258     7000      0   140000       0
## X430       4603.5133  35   14   88400.3190      810   6000    20000       0
## X1027522   5933.3748  43   16  144935.4067     7700  15300    17000   17000
## X387023    7560.2604  63   12   29809.4099     1500      0        0       0
## X13920     6330.8427  47   12   24669.8565      700      0        0       0
## X3160      4384.6392  46   13   33921.0526      300      0        0       0
## X19125     3276.5295  59   13  154186.6029     2700      0        0   10000
## X21480     1832.4461  60   14  431722.4880   220000      0   990000  156000
## X13180     4989.6261  47   12   32893.1419      750      0        0       0
## X7970      9005.8034  88   10   16446.5710    11000      0        0       0
## X11435     9493.2555  52   15  100735.2472        0   4100        0       0
## X16800     6750.1669  23   13   15418.6603      200      0        0       0
## X5575      6212.7090  86   12   18502.3923      100      0        0       0
## X9880      4017.0656  33   12   29809.4099      700      0        0       0
## X13440     5195.9143  53    8    9765.1515       10      0        0       0
## X17370     5668.8620  46   12   32893.1419      600      0        0       0
## X17200    10178.6915  76   12   26725.6778     1000   5000        0       0
## X3905      6746.5369  41   17  183996.0128    15800  18000        0   60000
## X14000     7172.7607  29   12    4214.4338        0      0        0       0
## X9710      5155.4648  54   12   15418.6603     1600      0        0       0
## X5300      6574.6565  76   12   42144.3381     8500   5200    86000  100000
## X12985     8867.1917  40   12   24669.8565     1200      0        0       0
## X2007524  10164.9687  56   14   37004.7847     1000      0        0       0
## X15575     6836.0441  39    9   32893.1419     2000      0        0       0
## X4245      8696.1124  67   12   20558.2137      400      0        0       0
## X21505     5686.0635  43   12   29809.4099        0      0        0       0
## X4215      7936.4547  22   15    8223.2855        0     30        0       0
## X12535    11794.4073  78   12   14390.7496      700      0        0       0
## X16475     9521.9200  75   10   10279.1069     6800      0        0       0
## X4570      6836.3888  53   13   10279.1069       50      0        0       0
## X15300     8124.9062  45   12   10279.1069        0      0        0       0
## X18200     6704.2828  40   12  121293.4609      810   1400        0   50000
## X2325      7803.2247  45   15  113070.1754     4900   2000    12000    1000
## X3430      4850.2392  61    2   35976.8740        0   1000        0       0
## X7495      6984.7124  32   10   12334.9282      300      0        0       0
## X489525    8641.0258  36   12   30837.3206      400    700        0       0
## X1680026   6750.1669  23   13   15418.6603      200      0        0       0
## X21375     7709.7569  42   14  101763.1579     1000   4201     1400  170000
## X11115     6525.0952  64    9   21586.1244      800      0        0       0
## X5220      5976.1950  26   16   30837.3206     6000      0        0    1000
## X1488027   5711.2392  52   12   94567.7831     2150   2610        0       0
## X21940     6131.7347  46   12  167549.4418     2530      0     7000       0
## X1364528  10246.9474  81   12   44200.1595    51000      0        0       0
## X21040     5845.6749  25   16   25697.7671       50      0        0       0
## X7125      6563.5801  49   13   31865.2313        0    600        0       0
## X8670      5196.5893  21   12    7195.3748      400    110        0       0
## X10640     6599.9746  47   12  136712.1212     1500      0        0       0
## X18375     3907.1193  26   12   29809.4099     1000   2500        0       0
## X20845     7247.0973  25   12   35976.8740     2400    205        0    2500
## X595       6578.6259  51   17  162409.8884     5000  27000        0    3000
## X1455      5851.6456  43   16  103818.9793     1000      0     8000       0
## X8760      4733.7759  81    4    4625.5981        0      0        0       0
## X626029    3036.6357  44   14  177828.5486        0      0    50000   85000
## X8475     11291.1815  53   17  137740.0319        0   2000   350000  150000
## X3085      6093.4654  30   14   54479.2663      300   2000    15000       0
## X9285      3597.7161  67   16  129516.7464     4000  70000    50000  300000
## X6940      7270.8124  49   13   48311.8022     1200    180        0       0
## X557530    6212.7090  86   12   18502.3923      100      0        0       0
## X16710     7779.0121  64   12    6167.4641     4500      0        0       0
## X15515     5222.0284  20   12   14390.7496       20      0        0       0
## X3530      4372.7256  48   16  102791.0686     3200      0    24000       0
## X6860      5487.6011  43   12   52423.4450        0      0        0       0
## X14630     4457.1985  34   12   31865.2313     1100     80        0       0
## X14705     9293.5925  76   14   28781.4992    27000      0        0       0
## X13010     4911.3897  30   12   70925.8373     8800   5000        0       0
## X1792031   6764.6424  57   12   25697.7671     1600    900        0       0
## X12705     5259.2512  65   11   25697.7671     2000   5000        0       0
## X9870      5335.8581  30   12   44200.1595      500      0        0       0
## X17305     6237.3654  32   15   55507.1770     1150   1500        0       0
## X21595     7445.0778  82   12   20558.2137        0   2000        0    2000
## X13725     7519.7320  55   11   27753.5885        0    700        0    4000
## X10040     7735.4996  88   13   11307.0175        0      0        0       0
## X7005      5575.5340  52   10   41116.4274        0   1000        0       0
## X3760      5696.2367  41   12   29809.4099     5000    120        0       0
## X14910     5933.3748  43   16  308373.2057     4000 100000   202000       0
## X13365     9269.3339  55    9    6270.2552        0     10        0       0
## X11410     5960.0579  33   16  119237.6396     2400      0        0       0
## X12220     2814.9047  36   16  855221.6906    60000 230000   350000       0
## X18420     7029.1679  31   13   80177.0335     1550      0        0    3000
## X9005      7842.3749  40   15   21586.1244        0      0        0       0
## X11855     7333.3380  38   12   12334.9282      200      0        0    7000
## X21405     3960.9948  55   11   30837.3206        0      0        0       0
## X21260     9441.6518  64   12   28781.4992      600      0        0       0
## X8020      4444.7730  55   17    8634.4498        0    500        0       0
## X10370     7938.8328  34   14   49339.7129      500    550        0       0
## X15255     7597.5728  73    3   23641.9458     3000      0        0       0
## X12735     5620.7655  43   12   72981.6587     1500      0        0       0
## X2635      8359.0939  58   17   57562.9984     2000   6000        0       0
## X4765      6452.8207  37   14   13362.8389      140      0        0       0
## X20295     6316.2726  38   16   82232.8549      500    200     3000   12000
## X4030      5316.4354  44   12    1233.4928        0      0        0       0
## X21360     9418.0637  31   11   69897.9266        0   2330        0       0
## X858032    7333.3380  40   12   39060.6061     1000      0        0       0
## X15240    11386.7530  30   17  125405.1037     2800    800   100000   15000
## X1052033   4336.5281  25   14   26725.6778      800   1500        0       0
## X8230      2913.2891  66   16  237447.3684   105000      0        0  375000
## X6565      6185.1676  45   17   51395.5343      590   5500    22000       0
## X8210      8582.8214  69    8   21586.1244        0   4000        0       0
## X16370     7369.8167  55   12   12334.9282     2000      0        0       0
## X2495      5634.0684  68   13   46255.9809     4500  15000        0       0
## X4950      9383.2642  78   12   21586.1244     3000    700        0       0
## X20625     3005.5195  67   17  169605.2632     5000      0        0       0
## X12640     6976.6009  38   17   85316.5869     3500   5800        0       0
## X16455     9143.1550  48   16  102791.0686     2000  35000        0   30000
## X20670     7144.1292  36   13   25697.7671        0      0        0       0
## X9855      5315.0495  28   11    2569.7767        0      0        0       0
## X7590      6391.5128  50   12   29809.4099        0      0        0       0
## X10390     8251.3463  82    1    6887.0016      160      0        0       0
## X6885      4867.2383  54   17  239503.1898    12700      0        0       0
## X12630     4832.6968  87   10   35976.8740     1000  95000   190000       0
## X587534    7786.5650  80   10    8326.0766     6000   1000        0       0
## X6415      5711.2392  45   16   97651.5152     1050  10000        0       0
## X13800     5809.2268  72   14   49339.7129     4500      0   184000       0
## X21210     5990.8378  39   12   64758.3732     1000   2500        0       0
## X20775     5111.3136  19   14    6887.0016      700      0        0       0
## X16165     6421.3632  33   16   65786.2839     3500      0        0    1200
## X249535    5634.0684  68   13   46255.9809     4500  15000        0       0
## X18530     5463.0804  57   17   76065.3908     2500  36000        0       0
## X1182036   5689.9210  38   16   69897.9266     2501   1200    48000       0
## X2485      2708.6805  49   17   61674.6411      400   3000        0   65000
## X16785    10609.3570  74   12   15418.6603     1900      0        0       0
## X11750     6777.0313  67   17  104846.8899    10000  22000     3000       0
## X3025      6021.0394  36   14   41116.4274      800      0        0    3500
## X3470      6106.0776  50   12   59618.8198     3000  13900        0       0
## X1860      7640.0210  43   14   97651.5152     4200   1000        0       0
## X3920      5952.7513  28   15   10279.1069      800      0        0       0
## X19430     8253.7779  46   12   19530.3030      500      0        0       0
## X16535     9276.8733  39   13   26725.6778      200      0        0     200
## X13620     7102.5441  50   13   32893.1419     1500   5000        0       0
## X17880     7448.2342  50   14   67842.1053     1700      0        0       0
## X4875      8104.8327  47   15   82232.8549     2000      0        0   50000
## X19300     8951.3784  49   12    7195.3748        0   1500        0       0
## X7075      4214.2722  27   17   32893.1419      400      0        0       0
## X15130     1258.0767  58   17  145963.3174    10000 100000  1350000  500000
## X13555     7031.0802  53   17  204554.2265    35000      0        0       0
## X8385      5165.9872  42   12   98679.4258      200    200        0       0
## X831537    6657.7220  69   16   51395.5343     1000 115000        0   10000
## X1330      6038.9240  21   12   55507.1770      200    550        0    2300
## X6710      6046.9947  44   14   77093.3014     2000   2000        0       0
## X6055      3727.8709  55   16  236419.4577        0      0        0       0
## X20455     2617.2982  46   12  263145.1356     8000  11000   405000   60000
## X2025      1782.7152  44   16  211749.6013     7000   1500   200000   70000
## X8485      8332.8960  69   12   24669.8565        0      0        0       0
## X6475      6829.6252  30   14   42144.3381     1000      0        0       0
## X4305      3570.0930  59   14   74009.5694    14000      0   296000       0
## X6900      9220.5450  71   16   51395.5343    15000  14000    95000  153000
## X14525     3787.5754  39   17  290898.7241     9000    900        0  300000
## X3070      5948.8229  51   16   34948.9633     1000   2620        0       0
## X811538    8155.4622  78   17   98679.4258     7000      0   140000       0
## X1420      5491.3648  58   16   58590.9091     7000  10000        0       0
## X542039    7143.7855  43   15   58590.9091       50   1960     1500       0
## X18380     3800.5231  81    9  101763.1579    36100   1700   135000  100000
## X4185      9295.7487  68   16   34948.9633      630      0        0       0
## X13830     6275.9318  41   14    5139.5534        0     20    10000       0
## X6590      8483.7784  51   12   88400.3190      480   3050        0       0
## X13340     4841.2957  41    8   32893.1419     1500      0        0       0
## X5625       673.0648  35   17  170633.1738     6000      0     7000       0
## X9625      8567.4103  58   14   93539.8724     4100   6000        0  150000
## X12020     7758.8189  75    9   17474.4817     3000      0        0       0
## X9580      3597.7161  53   16  170633.1738    10000     30   500000       0
## X277040    7022.5454  62   16   47283.8915     2700   2000        0       0
## X50        4662.1601  32   14   79149.1228        0   2000        0       0
## X369541   10109.2136  49   17   76065.3908     2000    500        0       0
## X7540      5516.1131  45   13   22614.0351      300      0        0       0
## X1030     10556.2567  77   12   29809.4099     1000   1000        0       0
## X14400     8454.0719  72   10    4008.8517        0      0        0       0
## X7415      3639.7242  41   14   29809.4099        0      0        0    2500
## X3990      6367.9364  65    5   12334.9282        0      0        0       0
## X3245      6936.3230  63   12   53451.3557     3300      0        0   60000
## X2575      5576.2209  67   17   65786.2839     3000    200     4000       0
## X9105      7529.2122  60   12   20558.2137     1500      0        0       0
## X7985      5978.3895  26   14   89428.2297     2000      0        0       0
## X1300      8554.5639  79   14   31865.2313      810   5100        0   17000
## X4760      4867.2383  39   17   78121.2121      400      0        0  116000
## X16305     6168.6436  22   12   33921.0526       60      0        0       0
## X21035     6648.1001  48   14   45228.0702        0   7500    30000       0
## X2905      5929.3158  75   13    9251.1962     3000  30000        0       0
## X1610      8138.5659  72   12   25697.7671    11000   8500        0       0
## X3490      5496.9282  56   12   15418.6603     4000    500        0       0
## X16585     9117.2509  21   12   28781.4992       50      0        0       0
## X4145      9942.5215  66   10   25697.7671        0      0        0       0
## X3135      7973.0718  35   12   61674.6411        0      0        0       0
## X6000      5336.4730  25   16   20558.2137     1500   3000    20000       0
## X10420     5478.6751  39   12   51395.5343        0      0        0       0
## X1655      5586.1606  58   14   34948.9633        0      0        0       0
## X10705     7409.4581  48    7   12334.9282     4000      0        0       0
## X11735     6187.6895  49   16  122321.3716    20000      0    71000       0
## X6720      5215.7507  22   12   18502.3923     1200    500        0       0
## X12680     8925.5408  27   12   37004.7847     2150    500        0    1000
## X7530     10000.0270  66   12   25697.7671     1200   6000        0       0
## X7795      6111.3967  41   13   40088.5167      120      1        0       0
## X1480      5140.6393  49   16   29809.4099     1800  21600        0       0
## X21575     6161.6719  43    2   68870.0159     1740  10500        0       0
## X2585      9636.8011  58   10   83260.7655     8000      0        0       0
## X16595     7065.6086  52    6    5550.7177        0      0        0       0
## X4040      7888.3857  68    7   14390.7496      100      0        0       0
## X1630542   6168.6436  22   12   33921.0526       60      0        0       0
## X16330     6551.0792  25   16   71953.7480     5000   2740        0       0
## X15665     8384.9108  36   17   12334.9282     2500  26000        0   21000
## X690043    9220.5450  71   16   51395.5343    15000  14000    95000  153000
## X6795      8112.0825  43   16  121293.4609      300    100        0       0
## X21350     9864.7351  68   12   24669.8565     1200    200        0       0
## X6700      1503.1836  66   16  289870.8134        0      0   275000   88000
## X18665     6813.5822  45   14   53451.3557        0   5000        0       0
## X19580     8548.9432  29   14   98679.4258    32800  20000        0   20000
## X20130     8003.3654  32   14   45228.0702      500   1700    10000    2500
## X10325     4593.4824  25   15   23641.9458     2300   3420        0       0
## X4130      3284.9326  35   16  122321.3716     6000   5000   110000       0
## X5475      6092.8720  57    3    6373.0463     3290   9200        0       0
## X1790      7500.1937  66   12   67842.1053     1300      0        0       0
## X17480     8048.4219  41   16   60646.7305     1000   3850        0       0
## X12830     5932.7371  47   12   55507.1770     1100  41000        0       0
## X1865      6282.7566  51    9   62702.5518     3500      0        0       0
## X768044    7263.7921  52   12   35976.8740     1700   3000     2000       0
## X457045    6836.3888  53   13   10279.1069       50      0        0       0
## X11490     8507.8199  88   12    5756.2998     2620  12400        0    5000
## X1146546   5134.4672  54   12   68870.0159      400    900        0       0
## X10180     5948.7354  23   16   20558.2137     2000    100        0    1000
## X3910      5973.1357  35   15   44200.1595     3000  55000   550000       0
## X11565    11386.7530  49   16  100735.2472     2500    450        0    4000
## X21825     6245.2759  24   14   30837.3206     1500      0        0       0
## X4525      4297.7367  42   16   17474.4817     1000      0        0       0
## X3060      8414.8911  76   12   14390.7496      900      0        0       0
## X9250      7609.5092  61   10   25697.7671     1000   1100        0       0
## X17500     5861.6929  56   12   53451.3557    21000  20800        0       0
## X1100      5302.7948  40   12   35976.8740        0      0        0       0
## X16025     7138.9429  42   12   56535.0877     1200   1251        0       0
## X12380     5632.2290  40   16   66814.1946     1000   1500        0       0
## X753547    5532.8460  23   14   25697.7671     1200      1        0       0
## X12850     4689.7790  33   16   61674.6411     3500   5600        0       0
## X1955548   8909.1588  47   14   25697.7671    17320    730        0       0
## X19775     6366.6587  81   12    9353.9872      750   2000        0       0
## X11525     4683.3579  52   10    8017.7033      390      0        0       0
## X2975      9116.0082  72   12   24669.8565      100    500        0       0
## X18895    10098.3165  23   12   18502.3923     2600      0        0       0
## X1602549   7138.9429  42   12   56535.0877     1200   1251        0       0
## X345       6272.7234  44    9   12334.9282     9200      0        0       0
## X490       6976.6009  44   16  116153.9075     3000      0        0       0
## X14580     1435.8448  50   16  488257.5758    10000      0 10000000 8000000
## X10875     7706.5913  41   17   63730.4625     2000  11500        0       0
## X5270      5113.1426  32   16   21586.1244       40      0        0       0
## X9400      8294.9122  57   14   24669.8565        0    230        0       0
## X12900     8693.4985  65   15   27753.5885     7200   3200        0       0
## X4530      8281.4292  79   12   14390.7496        0  35000        0       0
## X17670     6587.6319  42    8   30837.3206        0      0        0       0
## X5440      7096.7500  34   14   54479.2663      300     30        0       0
## X8875      4908.0300  36   12   62702.5518     3000   1030        0       0
## X2060     10858.1982  85   11   22614.0351      340   3500        0       0
## X2153550   8618.8325  66   12    9765.1515      550   5600        0       0
## X5080      6983.5455  49   12   51395.5343     1500      0        0       0
## X12500     7868.5326  45   16   68870.0159     1000  10000      300   22000
## X830       5696.6397  28   14    8634.4498      100      0        0       0
## X495051    9383.2642  78   12   21586.1244     3000    700        0       0
## X1304052   5263.6488  40   13   66814.1946        0    380        0       0
## X16685     6956.0287  24   12   30837.3206      300    200        0       0
## X5695      6842.0098  82   16   17474.4817     1900      0        0       0
## X2026553   4733.4575  40   16   28781.4992      820    400        0   18000
## X215       6221.1912  53   17   74009.5694    11000      0        0   25000
## X7460      9759.8671  45   12   43172.2488     3200    600     5000       0
## X21060     7675.9421  92   16    9251.1962     2500  18000        0       0
## X3770      6271.7559  61   12   56535.0877    20000   7000    35000       0
## X940054    8294.9122  57   14   24669.8565        0    230        0       0
## X15320     7356.9559  61   14   11307.0175      600      0        0       0
## X96555     5837.2792  83   12   14390.7496      600  20000        0       0
## X19340     9125.7408  78    8   18502.3923        0      0        0       0
## X1395      6579.0643  47   16   35976.8740     1300    500        0       0
## X939056    7393.9024  65   16   14390.7496     1000   3000     2000    2000
## X5245      8229.2596  61    6   11307.0175      430      0        0       0
## X18830     9253.5661  45   14   22614.0351      450    600        0       0
## X15215     6956.5608  39   16   66814.1946     2000    500    15000       0
## X496557    5025.3417  49   10       0.0000        0      0        0       0
## X12210     7636.2540  35   12   24669.8565        0  22000        0       0
## X17560     9704.9669  66   12   66814.1946     2500   5000        0       0
## X19625     8823.5948  63    1   17474.4817     4000  14700        0       0
## X2530      7480.5053  41   12   20558.2137     1000      0    46000     500
## X9075      6474.0924  41   12   66814.1946     1000   1000        0       0
## X1925      8462.7026  88   11   11307.0175        0   9900        0       0
## X21010     7610.4252  28   16   59618.8198        0  23000    50000       0
## X1745058   7513.2426  57   12   11307.0175        0      0        0       0
## X17555     2695.1985  39   13   42144.3381      700      0        0       0
## X2018059   8410.7240  61   15  101763.1579      450      0        0       0
## X5330      2017.1634  77   16  117181.8182     7000      0        0  100000
## X2150560   5686.0635  43   12   29809.4099        0      0        0       0
## X2970      4411.1662  56   12   13362.8389        0   2100        0       0
## X19190     9186.7808  35   17   68870.0159      200   1700        0       0
## X12570     8712.1533  39   17   23641.9458     1200      0        0       0
## X1325      6437.8045  66   12   32893.1419      900      0      200       0
## X4195      6522.8860  33   14   81204.9442     5000   3000        0       0
## X20915     2494.8521  45   12   61674.6411     3000      0        0       0
## X14145     6934.1103  46   12   77093.3014    87800   8400    19600    1000
## X13090     5132.5279  52   16   33921.0526      300      0        0       0
## X2211061   8414.4992  38   12   56535.0877     1500   1750        0       0
## X13062     6832.8261  50   17   64758.3732     1100  14700        0       0
## X7190      5867.3346  59   12    4522.8070        0     10        0       0
## X10690     8483.6259  79    9   12334.9282      200      0        0       0
## X21495     5951.9803  29   11   25697.7671     2000      0        0       0
## X3745      5958.4027  47   13   87372.4083     1000    370        0       0
## X2315      6840.0838  22   12   18502.3923        0     10        0       0
## X3170      6539.6647  51    3   10279.1069      160      0        0       0
## X10940     9740.7156  76   14   46255.9809     3200   1700   103000   20000
## X2116563   5386.4622  40   14   71953.7480        0 110000   135000       0
## X2109564   6777.8412  51   14   76065.3908      600      0        0       0
## X233065    6659.7322  46   17   20558.2137     1000    400     6000   40000
## X17530     5609.5420  38   16   31865.2313     1000      0    34000       0
## X12410     1832.4461  66   16  430694.5773     4500      0  1220000       0
## X1694566   6440.1730  79   12   39060.6061    27000  15000        0       0
## X1250      7269.0062  41   12   52423.4450     4270   6010        0       0
## X1323567   5454.0787  29   14    9251.1962     2000  20000        0       0
## X17190    11386.7530  42   12  207637.9585     2100   1100     6200       0
## X103068   10556.2567  77   12   29809.4099     1000   1000        0       0
## X1018069   5948.7354  23   16   20558.2137     2000    100        0    1000
## X18295    10394.6521  53   11   16446.5710      100    200        0       0
## X8770     10597.7984  76   16   18502.3923    11300      0        0       0
## X585       8414.4992  38   12   82232.8549     2100   3000        0       0
## X8750      3316.2325  35   17  105874.8006        0      0        0       0
## X13955     5457.0780  63    9   16446.5710      100   7800        0       0
## X18825      216.1459  60   16  256977.6714     1000      0   500000 5000000
## X12280     7685.7706  42   16   75037.4801        0      0        0       0
## X21780     5711.2392  53   14   56535.0877     1000      0        0       0
## X17810     6746.5369  62   17  105874.8006     4000  14000        0   15000
## X2535      5622.6830  62   16   64758.3732     1500  12000        0       0
## X1614070   6677.0208  68   14   44200.1595    22000  50000        0       0
## X17290     6934.7265  60   14   22614.0351     1500   3000        0       0
## X15400     1832.4461  59   12  325847.6874     1500  56000    17000   18000
## X1842071   7029.1679  31   13   80177.0335     1550      0        0    3000
## X1650      7512.8658  52   13   80177.0335     1000      0        0       0
## X2050072   8349.2691  26   10   19530.3030        0      0        0       0
## X360       9617.6686  29   11   17474.4817        0    100        0       0
## X1895      9047.5141  41   17  118209.7289     2550   2200    35000   95000
## X7350      5915.7400  40   12   89428.2297       20      0        0       0
## X107573    7485.5250  77   12   12334.9282     1700  12390    43000       0
## X2204574   3394.2432  68   12   80177.0335     2000  15000        0   25000
## X10345     6435.6192  41    9   18502.3923       20      0        0       0
## X90575     7456.1503  23   11   12334.9282      700      0        0       0
## X75        5826.0300  41   16  135684.2105     3000   2000        0    3000
## X1550      8535.9389  32   14   72981.6587     1050    310        0       0
## X6040      6911.1151  50   12   27753.5885      500      0        0       0
## X3525     11291.1815  75   17  104846.8899     5000      0        0       0
## X6980      6885.7776  65   12   69897.9266      590   8450        0     900
## X178576    9471.2816  63   12   21586.1244     2300      0        0       0
## X16935     6093.3713  22   12   24669.8565     8000      0        0       0
## X928577    3597.7161  67   16  129516.7464     4000  70000    50000  300000
## X9615      6226.2105  42   15   19530.3030        0      0        0       0
## X17940     7647.2525  21   12   33921.0526      280    500        0       0
## X2855      5717.5206  66   12    9148.4051        0      0        0       0
## X17900     8173.5225  56    6   27753.5885      590      0        0       0
## X1222078   2814.9047  36   16  855221.6906    60000 230000   350000       0
## X2020     11188.2330  75   10   12334.9282     1050   4000        0       0
## X10380     6301.9789  24   15    3083.7321        0      0        0       0
## X4000      9404.8200  56   12   32893.1419     2000      0    19200       0
## X90        7762.3371  63   12   49339.7129      500   3000        0     400
## X19145     6279.5012  46   17   88400.3190     1000   1500    69600       0
## X9140      4396.2365  59   15       0.0000     1600      0        0       0
## X2300      5313.0456  29   11   15418.6603       30      5        0       0
## X13560     4701.5738  50   12   18502.3923      600    130        0       0
## X1767079   6587.6319  42    8   30837.3206        0      0        0       0
## X9665      8183.9394  41   17  141851.6746      230    590      200   14000
## X9065      6428.1923  62   14   28781.4992      500   4500        0       0
## X12715     5000.2165  52   12   87372.4083      380      0        0       0
## X1069080   8483.6259  79    9   12334.9282      200      0        0       0
## X15150     5438.2853  42   15   41116.4274        0  48000        0       0
## X14780     3597.7161  71   16  121293.4609    27000  12000   900000 1000000
## X3080      5431.6661  46   16  131572.5678     1410   3340        0       0
## X1023081   7426.1415  49   14   30837.3206        0  10000        0       0
## X9725      8641.0258  42   12  107930.6220     2500   3700        0       0
## X1330082   5711.5575  59   12   16446.5710      100     70        0       0
## X3215     11386.7530  41   16  136712.1212     4000 103000    18000   70000
## X1069083   8483.6259  79    9   12334.9282      200      0        0       0
## X19635     4799.3712  49    9   10279.1069        0      0        0       0
## X14800     6505.3275  35   12   82232.8549     3000      0        0       0
## X2105584   8250.0749  67   12   12334.9282    10220      0        0       0
## X21380     4683.3579  51    6   11307.0175        0      0        0       0
## X2024585   6984.7124  25   12   56535.0877     2000      0        0       0
## X11040     6972.1301  33   16   34948.9633      300   1050        0     800
## X12070     8012.1578  73    8   18502.3923      920      0        0       0
## X3465      9199.2632  30   16  141851.6746     6000    510    13000   23000
## X20725     9942.5215  90   10    6167.4641        0      0        0       0
## X15730     8244.9929  29   12  113070.1754      100  30000        0       0
## X17005     1503.1836  73   17  154186.6029     2500   3000   315000   18000
## X4065      6197.2818  59   13   34948.9633     2200      0        0       0
## X620       6750.9058  56   12   18502.3923      300      0        0       0
## X1776586   6235.1707  83   15    7812.1212     2000      0        0       0
## X2715      8028.6114  32   12   57562.9984        0      0        0    1000
## X5210      9050.9752  48   14   46255.9809      800   1000     5000       0
## X1303587   8220.9304  83   14   81204.9442     7000 193000        0       0
## X18625     6316.2726  43   14   49339.7129      700   3310    30000       0
## X20460     4841.8475  25   12   28781.4992        0    300        0       0
## X4700      6088.9010  49   16   30837.3206     2000      0        0       0
## X256588    6641.8552  42   16   88400.3190     2000  44000   100000   20000
## X4180       146.7205  49   16  678421.0526        0  10000  1000000       0
## X5760      7480.3327  51   16  178856.4593     3750      0        0       0
## X286589   10911.3427  65   11   26725.6778     7000   6000     7500   22000
## X5745      5771.2818  34   12   18502.3923     2700      0        0       0
## X5175      5026.4489  36   11   19530.3030        0    300        0       0
## X15105     6017.3025  48    8   26725.6778       10   4000        0       0
## X19895     5145.1112  58   12    5139.5534      310      0        0       0
## X1210      6281.8256  51   16  140823.7640    11200  28000        0       0
## X1998090   7630.0979  86   17   37004.7847    10000  20000        0       0
## X202091   11188.2330  75   10   12334.9282     1050   4000        0       0
## X14570     6568.5137  33    9   42144.3381     1000   1700     9000       0
## X2100      9341.9182  47   12   35976.8740      480      0        0       0
## X1208092   5664.1469  20   14   21586.1244      700      0        0       0
## X21340     6229.3682  22   14   22614.0351     1000   2000        0   25000
## X14250     6519.0232  30    7   24669.8565        0      0        0       0
## X1719093  11386.7530  42   12  207637.9585     2100   1100     6200       0
## X2105594   8250.0749  67   12   12334.9282    10220      0        0       0
## X11425     6084.3695  54   17  160354.0670     7230   1500    39000   20000
## X2135095   9864.7351  68   12   24669.8565     1200    200        0       0
## X13565     6017.3025  50   15   35976.8740       50      0        0       0
## X17540     3468.5962  42   17  164465.7097        0   8100    84000  100000
## X2985      8494.8290  77    7   15418.6603        0      0        0       0
## X15070     4397.4675  30   16   15418.6603     2200    500     4800       0
## X3505      9857.4584  70   13   19530.3030     1800  11000        0       0
## X15015     9887.6811  55   14   66814.1946     1530      0        0       0
## X16815     4348.7029  32   12   13362.8389      500      0        0       0
## X7485      6228.4329  45   16   41116.4274        0      0        0       0
## X18460     5665.4249  64   16   65786.2839        0      0    65000       0
## X9465      5717.0688  42   13   54479.2663        0      0        0       0
## X10825     6128.8931  32    9   15418.6603       50   2500        0       0
## X8105      7108.2801  34   16   71953.7480     1150  11500        0  170000
## X5820      6043.2868  54   13   38032.6954     1000  22000        0       0
## X14765    11386.7530  71   12   38032.6954     4500   3000        0  160000
## X5340      2818.4551  54   17  175772.7273     6000  35010   140000    7000
## X3720      4804.4293  34    9   22614.0351        0      0        0       0
## X4475      6685.2809  59   17   25697.7671      350      0    35000       0
## X15185     9717.4090  36   12   30837.3206      520   1900     6000       0
## X68096     6833.6584  48   16  100735.2472     3800      0    13000       0
## X13745     7513.1999  39   11   11307.0175        0      0        0       0
## X125097    7269.0062  41   12   52423.4450     4270   6010        0       0
## X8345      6991.5808  33   11   53451.3557      400   1200        0       0
## X2435      6752.1781  57   13   17474.4817      280      5        0       0
## X1900      3591.7791  77   17   38032.6954     6000  15000   175000  250000
## X11670     5731.3659  58   16   18502.3923     5700   2500    24000       0
## X18465     8998.8304  61   12   47283.8915     5200   4000        0       0
## X20605     9663.4260  39   12   27753.5885     1000   1500        0       0
## X1127598  10483.6685  68   12   27753.5885     3300   8000    32000  116000
## X15815     7402.3350  41   12   49339.7129     1500    100        0       0
## X4465     10051.0999  69   15   61674.6411     4500      0    28000       0
## X14585     6609.2130  49    2  134656.2998      500      0    30000    1000
## X12930     6801.3996  40   14   43172.2488     2500   1000    20000       0
## X3875      6076.2830  33   14   41116.4274     1000   6000        0       0
## X11340     5351.9437  79   14   41116.4274     2300  15000        0   50000
## X4985      7666.4941  56   12  113070.1754      880      0        0   50000
## X21245     7308.9786  33   12   29809.4099        1      0        0       0
## X1203599   5803.8741  35   12   46255.9809      770      0        0       0
## X18640     9424.8965  70   16   21586.1244      150   3000        0       0
## X3875100   6076.2830  33   14   41116.4274     1000   6000        0       0
## X7570      7388.8300  37   12   48311.8022     2500    150        0       0
## X16115     7197.0427  46   16  102791.0686        0   5000        0  200000
## X6355      6830.9965  46   12   29809.4099        0      0        0       0
## X17615     7230.1995  48   12   30837.3206       30    400        0       0
## X6920      4859.1365  80    3    5550.7177        0      0        0       0
## X8960      4327.1130  32   16   83260.7655     3540    330        0       0
## X2325101   7803.2247  45   15  113070.1754     4900   2000    12000    1000
## X255       8959.1771  57   14   77093.3014     2000  10000        0       0
## X19515     4547.3081  28   16   82232.8549     1110      0    20000   25000
## X19205102  7691.5051  44   12   53451.3557     1000   1050        0       0
## X15825     6525.3098  38   16   58590.9091     1000      0        0       0
## X9090      5431.0831  47   14   39060.6061      500    580        0       0
## X4540     10335.8356  83    8    7709.3301      700   8000        0       0
## X15225     1483.1859  46   16  575629.9840    13460  59970   330000  600000
## X10300     4752.0953  24   12    7400.9569        0    120        0       0
## X21650     2980.8019  36   14   41116.4274     2000    200        0       0
## X3780      4750.7510  55   12   10279.1069       10      0        0       0
## X13835     7147.1438  54   15   57562.9984     2500      0        0       0
## X5100      7138.9429  41   13   51395.5343     2000      0        0       0
## X11025     5317.0913  43    2   65786.2839     3000   8700        0       0
## X13740     5573.7811  36   12   32893.1419     2700   6000        0       0
## X17905     8897.0175  33   16   32893.1419      900     60        0       0
## X925       4893.8646  54   13   41116.4274     2000    500        0       0
## X11925     9050.7760  42   17   12334.9282      510      0        0       0
## X5210103   9050.9752  48   14   46255.9809      800   1000     5000       0
## X14005     4375.7084  24   12   11307.0175        0      0        0       0
## X17815     6270.3754  56   16     102.7911     1000      0        0       0
## X14200     6430.0038  43   12   57562.9984      610   2890        0       0
## X2855104   5717.5206  66   12    9148.4051        0      0        0       0
## X310105    5950.2488  40   12   21586.1244     2000    500        0       0
## X15355    10936.6997  71   17   26725.6778     4000      0     2000       0
## X15135     7500.0875  30   13   43172.2488     6500   3800        0       0
## X9020      8109.0436  30   16   83260.7655    11000      0        0       0
## X18630     9692.4653  42   10   82232.8549     2520    800        0       0
## X17315     9117.2509  23   12   15418.6603        0    105        0       0
## X19685     5755.9082  49   12   53451.3557     1000    300        0       0
## X7100      5700.7692  42   12   30837.3206      400    220        0       0
## X12945106  6490.4551  27   12   28781.4992        0      0     4000       0
## X7655      1537.8345  39   16  411164.2743    18000      0    50000   50000
## X20875     4540.8913  31   12    6681.4195        0   1650        0       0
## X9300      7690.7114  32   13   52423.4450      210    700        0       0
## X16905107 11386.7530  76   16   28781.4992     5600   6800    48000  100000
## X155       6972.1301  31   12   46255.9809     4300    300        0       0
## X15280     4540.8913  32   13   14390.7496        0      0        0       0
## X17385     5748.4001  85   14    9045.6140     1200      0        0       0
## X12880     9610.2593  29   16   69897.9266     2400    250        0       0
## X1595      8089.9766  43   17  175772.7273     2990   1800   103000    2500
## X5720      7359.6699  30   12   51395.5343      200   3000        0       0
## X17375     5950.0337  58   14  100735.2472      500  18000        0       0
## X11495108  9240.9040  44   16   81204.9442     5700   2200        0    8800
## X7680109   7263.7921  52   12   35976.8740     1700   3000     2000       0
## X2590      6645.0764  48   15   28781.4992     1820    850        0       0
## X7200      9629.4836  72   12   10279.1069     1000      0        0       0
## X1575      6805.1027  38   10   61674.6411      850      0        0       0
## X12065     4620.4798  37   12    6887.0016        0      0        0       0
## X9715      8414.4992  39   17  123349.2823     2500   2000        0       0
## X5065      5298.4252  62   10   11307.0175        0      0        0       0
## X9520      4900.7021  39   13   30837.3206     1000   2300        0       0
## X20565110 11386.7530  40   17  193247.2089     5000   1000        0   23000
## X2100111   9341.9182  47   12   35976.8740      480      0        0       0
## X5175112   5026.4489  36   11   19530.3030        0    300        0       0
## X15880    10431.8465  53   12   40088.5167     1300   4500        0       0
## X615       6944.9344  22   12   18502.3923     1000      0        0       0
## X19490     6198.3243  47   17  144935.4067     2500   8000        0    7500
## X13850     5611.6356  32   13   38032.6954      400      0        0       0
## X14070     7228.2467  74   12   33921.0526      520   3900        0       0
## X16555113  7602.8631  71   12   16446.5710     1000      0        0       0
## X21750     6950.8753  52   17   39060.6061      300   1200        0       0
## X1305      7662.7554  35   13  123349.2823     1600  15500        0   35000
## X5210114   9050.9752  48   14   46255.9809      800   1000     5000       0
## X14400115  8454.0719  72   10    4008.8517        0      0        0       0
## X4120      3435.4000  83   16   30837.3206    10000      0        0       0
## X13600     9383.0100  32   14   92511.9617        0   1300        0   55000
## X1670     11259.5753  66   12   18502.3923      800      0        0       0
## X8790      7029.6621  44   16  100735.2472      400   1900        0    1300
## X8150      6310.8060  47   12   53451.3557      750   5000        0       0
## X14155     6695.2174  73    4   33921.0526    11500      0        0       0
## X17905116  8897.0175  33   16   32893.1419      900     60        0       0
## X16735     5804.3179  66   16   51395.5343     1000      0        0       0
## X21095117  6777.8412  51   14   76065.3908      600      0        0       0
## X10280     9401.2080  80   16   42144.3381     9200      0        0       0
## X8695      6033.5784  22   12   29809.4099        0      0        0       0
## X15485     5972.4428  50   14    4933.9713        0      0        0       0
## X920118    5812.9110  44   15   87372.4083      300  32700        0       0
## X525       8516.9613  73   12    9559.5694     1300      0        0       0
## X10740119  9507.0043  78    5    7709.3301        0      0        0       0
## X8885      5016.1313  47   12   41116.4274      240    250        0       0
## X20200     5845.6749  26   14   52423.4450     1000    300        0       0
## X2295      7105.8186  69    6   17474.4817      720      0        0       0
## X14855     4017.0656  34   12   27753.5885      800      0        0       0
## X20390    10431.8465  50   13   48311.8022     3000      0        0       0
## X13895     5545.0508  35   12   37004.7847     1020      0      770       0
## X12335     8147.8136  76   12   30837.3206    18000  11000        0       0
## X11880     5934.1263  54   12   63730.4625        0      0        0       0
## X3750      1541.4366  65   15   65786.2839     3000   2000        0    2500
## X16305120  6168.6436  22   12   33921.0526       60      0        0       0
## X11875     6985.4172  42    9   45228.0702        0    100        0       0
## X7670121   7111.7751  50   14   53451.3557     3100      0        0       0
## X6130      7499.2157  41   12   11307.0175        0      0        0       0
## X8050      7657.5067  56    7   47283.8915      150      0        0       0
## X17970     8302.0167  72   16   15418.6603        0      0        0       0
## X18735     4949.8349  49   16   10279.1069        0     50        0       0
## X6920122   4859.1365  80    3    5550.7177        0      0        0       0
## X235       4923.1401  48   12   42144.3381        0      0        0       0
## X5530      6282.3042  52   17  118209.7289        0      0    10000   10000
## X18720     5887.3046  49   12    2055.8214        0    400        0       0
## X2330123   6659.7322  46   17   20558.2137     1000    400     6000   40000
## X335       4177.0529  41   12   25697.7671        0   1500        0       0
## X19405     5895.3995  29   12   69897.9266     2830    910        0       0
## X2615      8241.5555  46   12   60646.7305     2400   1500        0       0
## X8060      6654.5552  67   12   61674.6411     2200  15000        0  100000
## X7990      4922.4516  27   14   24669.8565      630      0        0       0
## X17205     8028.6114  34    5   37004.7847     1400    200        0       0
## X110      10373.1531  79   12   16446.5710      800      0        0       0
## X16470124  3597.7161  43   12 1408237.6396       10      0        0       0
## X8690      7624.4548  48   12   82232.8549     4000   7000        0       0
## X3350     11386.7530  58   13   30837.3206    11000 159900        0    2500
## X2440      5845.6749  30   15   46255.9809     1020      0        0    5700
## X9570      6262.5384  20   14   24669.8565      100      0        0       0
## X19650     7270.8124  48   13   29809.4099      650   7000    40000   25000
## X12680125  8925.5408  27   12   37004.7847     2150    500        0    1000
## X6175      7664.6497  52   11   68870.0159     1800      0        0    1200
## X2860126   6604.7905  45   12   20558.2137     1500    530        0       0
## X21470     9663.4260  42   12   25697.7671      600      0        0       0
## X9360      5700.4279  22   13   82232.8549      300   5500        0       0
## X3235127   6509.0382  57   12   51395.5343      700   2200        0       0
## X10540     3529.8025  52   17  102791.0686    11000      0   464000  120000
## X18595     1503.9178  49   17  116153.9075     5000  66500   115000   19000
## X13935     5057.4235  86   17   72981.6587    24000      0    60000   35000
## X16950     5315.7504  52   16  114098.0861    10000  10000        0       0
## X9715128   8414.4992  39   17  123349.2823     2500   2000        0       0
## X11980     8233.0605  20   11   20558.2137      200      0        0       0
## X2345      6804.0267  40   17   53451.3557     4500    520    70000       0
## X21130129 11097.5342  78   12   15418.6603      310      0        0       0
## X12600     6482.1927  53   12   35976.8740      600   2500        0       0
## X3470130   6106.0776  50   12   59618.8198     3000  13900        0       0
## X9425      5837.2792  89   14   13362.8389      800  10000        0       0
## X21625     6119.4284  27   12   32893.1419     1000      0        0       0
## X13110     6654.0201  68   16   15418.6603      160  16000        0       0
## X10765     8414.4992  37   12   48311.8022     1000    770        0    2000
## X10290     6556.6396  37   14   95595.6938     1500   5600    80000       0
## X20650    11386.7530  48   16   31865.2313     5000    200    40000       0
## X20680     5009.8727  42   12   53451.3557     2000      0    84000       0
## X20325     7403.3590  38   13   70925.8373     2500  38800        0       0
## X15740     5752.8640  54    6   15418.6603        0      0        0       0
## X10040131  7735.4996  88   13   11307.0175        0      0        0       0
## X2085      7857.7937  38   10   10176.3158      780      0        0       0
## X18375132  3907.1193  26   12   29809.4099     1000   2500        0       0
## X15970     3029.4357  58   14   50367.6236     3000      0   180000   29000
## X15490     6835.9388  43   16   56535.0877     1200  24150     3500       0
## X9805      7615.4140  47   14   14390.7496      200      0        0       0
## X19805     5974.4062  32   13   62702.5518       70      0        0       0
## X6710133   6046.9947  44   14   77093.3014     2000   2000        0       0
## X20265134  4733.4575  40   16   28781.4992      820    400        0   18000
## X16850     4798.3702  27    7   32893.1419       20      0        0       0
## X10875135  7706.5913  41   17   63730.4625     2000  11500        0       0
## X10560136  7091.0393  78    6    8223.2855      660      0        0       0
## X19625137  8823.5948  63    1   17474.4817     4000  14700        0       0
## X13545     6930.3176  64   10   37004.7847      300   3000        0       0
## X13725138  7519.7320  55   11   27753.5885        0    700        0    4000
## X13385     6683.4423  40   12   81204.9442     2000      0     1000       0
## X16935139  6093.3713  22   12   24669.8565     8000      0        0       0
## X4930     11139.3320  79    7   21586.1244     8000  15000        0       0
## X20930     7581.9742  44   12   41116.4274      530    800        0     500
## X17100     5971.9347  57   14   28781.4992     8600  13000        0    5000
## X18795     8287.0909  78   11   19530.3030     3000  14000        0       0
## X1315      4712.3192  42   12   29809.4099        0   1500        0       0
## X3990140   6367.9364  65    5   12334.9282        0      0        0       0
## X6590141   8483.7784  51   12   88400.3190      480   3050        0       0
## X10940142  9740.7156  76   14   46255.9809     3200   1700   103000   20000
## X17560143  9704.9669  66   12   66814.1946     2500   5000        0       0
## X300       3764.5960  41   12   31865.2313       60    200        0       0
## X11475     7178.3649  23   16   74009.5694      810      0     3000       0
## X15370     5071.1464  52   12   44200.1595      600    340        0       0
## X12230     5301.4026  31   10   15418.6603      130      0        0       0
## X6570      7914.2643  58   13   21586.1244      200    200        0       0
## X13610     7169.1759  36   11   51395.5343       80     10        0       0
## X1940      7388.8300  37   17   59618.8198     4500   6000        0       0
##                FIN  VEHIC  HOMEEQ OTHNFIN     DEBT NETWORTH
## X17470       39600   6400   84000       0  40200.0   170800
## X315          5400  21000    8000       0  58640.0    17760
## X8795        15460   2000   12000       0  19610.0     9850
## X10720       54700  18250   90000       0   8000.0   284950
## X19170       12800   9100   47000       0  21000.0   268900
## X22075       70500   7500  175000       0      0.0   253000
## X12235       16000  16000       0       0  31000.0     1000
## X7670        12200  34000   22000       0  60600.0    45600
## X16555       13000   1800   15000       0      0.0    29800
## X370            50   1300       0       0   9800.0     -450
## X7680        12700   4200    8000       0  92000.0    24900
## X6880            0   3300   15000       0   3400.0    14900
## X16570       64100  31000       0       0  36200.0    58900
## X12945        4000   9400       0       0   1500.0    11900
## X6725         9050   8800   75000       0      0.0    92850
## X15725     1238000  69000 1600000       0      0.0  4032000
## X19880        4015  38000    7000       0 147400.0    -7385
## X225        813000  15000  130000       0      0.0   975000
## X4995       393440  14400  315000       0      0.0   722840
## X7700        48750  15400   20000       0 230000.0   194150
## X11375       20300  37800   52000       0  18810.0    91290
## X17920      111500  11000   88000       0  32340.0   200160
## X12365      120520  26900   87500       0  17300.0   230120
## X920         93000   7700   59000       0  66000.0   159700
## X19050      313500   7300  500000       0      0.0  1079800
## X19555       18650  30300   64000       0 125900.0   383050
## X10520        2300      0       0       0    600.0     1700
## X18705       16550  15200  111000       0 142000.0   129750
## X5095        60100   9600  333000       0      0.0   402700
## X11010         250   5800       0       0    840.0     5210
## X3540          760   4100       0       0     30.0     4830
## X14950      162700   3700   67000     800  43300.0   223900
## X4830         1350   4800    6000       0      0.0    12150
## X2865       167500   9300   75000       0      0.0   251800
## X20945      122710  29000  110000       0 159500.0   242210
## X13040       19880  13500       0       0   4400.0    28980
## X4515       135700   7900       0       0    780.0   142820
## X145            40      0       0       0    400.0     -360
## X18685        2480   8600   40000       0   5100.0    45980
## X17585       92100  22000   43000       0  92000.0   148900
## X10090           0      0       0       0   1300.0    -1300
## X13235       25000      0       0       0      0.0   525000
## X3045       287000  14700   34000       0 137100.0   324600
## X21425          90      0       0       0   5450.0    -5360
## X11840           0      0       0       0   6000.0    -6000
## X3400        25100   9900   44000       0  16600.0    73400
## X6635        25600   4200    5000       0   2100.0    32700
## X19815       80900   4700       0       0    300.0    85300
## X19565        7350  11800    9000       0  84770.0    26380
## X12135       67650  38900  108000       0 181400.0   168850
## X10700      640000  22000  100000       0      0.0   836000
## X2600        54700   4600   32000       0  98000.0    91300
## X2860         4830   2800       0       0    650.0     6980
## X2175         3600      0       0       0    900.0     2700
## X14915       72800  28800   15000       0  78100.0   108500
## X66351       25600   4200    5000       0   2100.0    32700
## X6575        21000  38000   19000       0  16000.0   709000
## X8410        12800  30000   18000       0  73000.0    39800
## X7230       360350  11000   60000       0      0.0   493850
## X12955        1000  16020   31000       0  20900.0    77120
## X19205       57550  13000  136000       0  60200.0   200350
## X600          9190  19700   27000    8000 139900.0    46990
## X1290         4750   6300       0       0  13400.0    -2350
## X17070     1189000  47000  425000   20000 340000.0  2031000
## X16140      226800  12000  100000       0      0.0   338800
## X17935       36000  31100  102000       0  93500.0   301100
## X3605        92000  17700       0       0   8800.0   100900
## X10275      274000  26500  106000    5000 180600.0   404900
## X19930       32450  23000  178000       0  40000.0   225450
## X15360      137000  40000  185000       0 162900.0   444100
## X1075       158090  14300   37000    2500      0.0   223890
## X7770       172600   6100   65000       0      0.0   243700
## X1010         2700      0   21000       0 108000.0    23700
## X7095          660      0       0       0      0.0      660
## X14255        2960   1900   33000       0    940.0    42920
## X20075       30000   7100  170000       0  11030.0   196070
## X2610          570   2700       0       0   1200.0     2070
## X965         49600   3400       0       0     20.0    52980
## X17515       37000   5000  115000       0  21200.0   155800
## X1755            0      0       0       0      0.0        0
## X16440      329500  81800   91000       0  59000.0   509300
## X14750        9200   2300   12000       0  62000.0    19500
## X16960       82800  31400  100000       0 115000.0   199200
## X575          2000   3800       0       0      0.0     5800
## X12340      317500   5600  217000       0      0.0   540100
## X3250          300   3800       0       0      0.0     4100
## X21805     1325001  15000  570000   25000 188000.0  2087001
## X17860       42400  30000   90000       0  72550.0   164850
## X6260       379000  31000  389000       0 254400.0   794600
## X8435         7160  14000       0       0  13480.0     7680
## X10795      151530  22800  125000       0      0.0   299330
## X9785         6650  11800   15000       0  55990.0    32460
## X17455        6000   2100   23000       0  52500.0    30600
## X11275      351300  12500  200000       0      0.0   563800
## X6785        61005      0   38000       0  93040.0    90965
## X12920           0      0       0       0      0.0        0
## X12685       49650   7400   62000       0  18000.0   114050
## X7575        31700   6800       0       0    300.0    38200
## X16745        3300   5000    2000       0  79800.0     3500
## X3925            0  13200       0       0  20000.0    -6800
## X13715        1140      0       0       0      0.0     1140
## X2630        50500  38000    7000       0 220500.0   224000
## X1880         1220   3100       0       0   1000.0     8320
## X16810        1520   7500       0       0  14100.0    -5080
## X7535        31251  16100       0       0  24700.0    24951
## X17395           0      0       0       0      0.0        0
## X20265       19220   4300       0       0      0.0    23520
## X16645      105330   9900  276000       0 254000.0   386230
## X18180      565400  24600  209000       0    840.0   798160
## X4825         1000  11300       0       0   3020.0     9280
## X1845       168000  21900  250000       0  18600.0   539300
## X5425        22800  21800   80000       0 220300.0   124300
## X10600      214000  54800   89000   30000  12800.0   386000
## X10360           0      0   27000       0  10000.0    27000
## X19890       19300   8800       0       0   5600.0    22500
## X20500           0      0  163000       0      0.0   163000
## X2565       430000  26600   70000       0   5000.0   526600
## X26002       54700   4600   32000       0  98000.0    91300
## X19845        3640  17000       0       0  13000.0     7640
## X18965       88600  42600   67000       0  28000.0   198200
## X11230           0      0       0       0      0.0        0
## X11260      173900  11600  420000       0    300.0   605200
## X3200        45740  41200  201200       0  60500.0   236440
## X5965        24020   7300       0       0    350.0    30970
## X107953     151530  22800  125000       0      0.0   299330
## X11035       51400   2500   60000       0      0.0   113900
## X18245       35800 122300   98800       0 104800.0   491300
## X11955       20000  27900    3000       0  47600.0    80300
## X9345           80      0       0       0      0.0       80
## X2320        25970  18100  118000       0  36680.0   147390
## X9295        46400   6000  290100       0  13100.0   339300
## X20110       16900      0    8000       0 167750.0    24150
## X680        327800  43000  220000       0      0.0   590800
## X13270       81350   9800       0       0   6500.0    84650
## X3075        78050   7800   29000       0 110840.0    71010
## X13160       85100   8800   37000       0 103000.0   120900
## X20435      780000      0  850000       0 850000.0  2780000
## X12465      783000  12100  170000       0  30000.0  1015100
## X4440          760      0       0       0      0.0      760
## X3870        78700   9400   20000       0   9800.0    98300
## X3510         7900  47500   29000       0  86300.0    73100
## X13795           0      0       0       0      0.0        0
## X18155        9000  18200       0       0  68300.0    76900
## X4685       115700  28700  141000       0  73900.0   270500
## X20135         770  17100       0       0   3610.0    14260
## X7975        40000   3500  210000       0      0.0   253500
## X16425      116300  23600   37000       0 201120.0   178780
## X84354        7160  14000       0       0  13480.0     7680
## X12905      196000   9000   58000       0 164000.0   298000
## X15095      133010  34200       0       0      0.0   167210
## X3625            0      0   10000       0      0.0    10000
## X198455       3640  17000       0       0  13000.0     7640
## X570         57001   8400   20000       0  11200.0    74201
## X21195      279390  21700  125000       0      0.0   506340
## X16470      400060  26000  380000       0      0.0  2856060
## X14880        6340   9900  106000       0  49000.0    98240
## X9485        11405  21700   47000       0  37950.0    55155
## X17090         700   4500       0       0      0.0     5200
## X9670        55500  31000  170000       0 139600.0   596900
## X15945           0  10100       0       0     70.0    10030
## X13535         400  22800       0       0  18800.0     4400
## X3685        17370   6400    4800       0  11360.0    25410
## X540          4930   5300   46000       0 258800.0    88930
## X17780      253900  13000  112000       0  28000.0   378900
## X21100      903000   9300  109000       0  34000.0  1093300
## X4310         2800   7700   25000       0 147100.0     3400
## X2010         1200   3000   20000       0 100000.0    24200
## X8785       406000  28800   80000       0      0.0   787800
## X1045         8500   6700   62000       0  23720.0    76480
## X2935         1500      0   70000       0      0.0    71500
## X11195      135000   2200       0       0   2300.0   134900
## X110356      51400   2500   60000       0      0.0   113900
## X3410        88711  25600   62000       0  38700.0   175611
## X17765       12000   4500       0       0      0.0    16500
## X9175        22200  31900   25000       0 107700.0    46400
## X6395        19000  12000       0       0    520.0    30480
## X485          5400  12600       0       0   4140.0    13860
## X870         52440  50700  149000       0 174000.0   208140
## X9220           40      0       0       0    300.0     -260
## X1920          400   6300   28000       0   1600.0    33100
## X19230      169350  18200  300000       0      0.0   487550
## X18475      875950   9100  240000       0      0.0  1125050
## X5895         1120   1700       0       0      0.0     2820
## X3695        40100   5900    8000       0 169000.0    54000
## X17075      126500  13000   10000       0  96700.0    92800
## X21685         800  13900   23000       0  39200.0    24500
## X10410       18300  15000       0       0      0.0    86300
## X1350            0      0       0       0      0.0        0
## X18760       61400  17000  124000       0 303000.0   487900
## X3405          825  13000       0       0  28800.0   -14975
## X12035        1170   8900   28000       0 108800.0    26270
## X305        223800  16200   60000       0      0.0   300000
## X17850       83500  13000  201000       0 129700.0   416800
## X4110      3577000 188000 1400000       0 400000.0  5465000
## X4605        16800  10900  100000       0      0.0   127700
## X12555           0      0       0       0      0.0        0
## X5915       478500      0  410000       0 290000.0  2278500
## X22035           0      0       0       0      0.0        0
## X6930        64000   8000   50000       0 128200.0   303800
## X17060       13800  17000  143000       0      0.0   173800
## X13760       71750  50600  100000       0  10000.0   517350
## X5825         1300  11600       0       0      0.0    12900
## X34057         825  13000       0       0  28800.0   -14975
## X20180        3350  49700   10000       0 156300.0    39750
## X21130         310   4400   83000       0      0.0    87710
## X12205           0      0       0       0  30000.0   -30000
## X1265         7300   9100   85000       0   9100.0    94300
## X13645      108000  17500  200000       0      0.0   325500
## X905           700      0       0       0      0.0      700
## X21995       36300      0       0       0     60.0    36240
## X6975         1100   7600       0       0    870.0     7830
## X16450       20920  12000       0       0  20700.0    12220
## X14840        7000      0       0       0      0.0     7000
## X8300       348950  27200   25000       0 126500.0   370650
## X645        132000  27000  195000       0 155000.0  1074000
## X2770        11400   6300  115600       0  41400.0    96300
## X147508       9200   2300   12000       0  62000.0    19500
## X1540         3560  13000    2000       0  91140.0     3420
## X19435           0      0       0       0      0.0        0
## X6765        10400  38800   32000       0  76200.0    73000
## X54259       22800  21800   80000       0 220300.0   124300
## X19980       45000  13000  100000       0      0.0   158000
## X54010        4930   5300   46000       0 258800.0    88930
## X21890      200650  38400   60000       0 176000.0   263050
## X1220        23200   2900   27000       0      0.0    53100
## X16615     1107000  43700  179000       0 103400.0  1622300
## X16905      680430   9900  200000       0      0.0  1020330
## X9050       780200  33000  360000       0 135000.0  1577800
## X21165      426300  79200   60000   70000 143000.0   632500
## X16350           0      0       0       0      0.0        0
## X14085         200  11000       0       0  12500.0    -1300
## X11465        1600  12800       0       0    800.0    13600
## X12610       30000  23100 4000000  120000      0.0  4173100
## X785         18900  24700       0       0  38000.0    20600
## X14485        1860   9300   31000       0  42200.0    41960
## X8580       101400   6700   24000       0  66300.0   128800
## X10340         500  33000   27000       0  93000.0    35500
## X20855         650      0       0       0      0.0      650
## X5420        12310  13700   -7000       0 168930.0   -32920
## X1200        68700  18600   77000       0   2400.0   161900
## X13395       39500  39600   60000       0 274000.0   145100
## X10230       10000  10000  159000       0  41000.0   179000
## X17945        8960   5200   11000    5100 192650.0  -101390
## X565           300   6700       0       0   1200.0     5800
## X18070       39300   1500  160000       0      0.0   200800
## X509511      60100   9600  333000       0      0.0   402700
## X8940        29000  17500       0       0   9800.0    36700
## X11575      154000   4000  211000       0   9950.0   368050
## X1213512     67650  38900  108000       0 181400.0   168850
## X14770       81950  28200   62000       0 139000.0   151150
## X22015       44800  20800  125000       0  13200.0   277400
## X4965            0   3600       0       0  15000.0   -11400
## X1660       148000  15000  133000       0  23800.0   272200
## X20795        4510   5500       0       0   4200.0     8910
## X2045         2000  27000    6000       0 195800.0    -1800
## X10235      395201      0  400000       0   5000.0   790201
## X12060         220   4400       0       0    900.0     3720
## X5680       203700  31100   27000   50000 253610.0   294190
## X20215       43660   9500       0       0  12000.0    41160
## X15375        3200   6400   20000       0      0.0    29600
## X10740           0      0  102000       0  28600.0   101400
## X4160       389500      0  300000       0  30000.0   714500
## X310          2500   2000       0       0    600.0     3900
## X3235         6200  20100   70000       0  24670.0    91630
## X21055      104220   3900  135000       0   7000.0   254120
## X2620            0      0       0       0      0.0        0
## X1600        35820      0       0    5000   5860.0    34960
## X1751513     37000   5000  115000       0  21200.0   155800
## X5765            0   2000       0       0    650.0     1350
## X16945      413300  20800   90000       0    180.0   553920
## X20830      320000   7600   60000       0 580000.0  7547600
## X10105      412500  21000   12000       0 187600.0   420900
## X4895        63100   9980   18000       0  62000.0    91080
## X9895            0      0       0       0      0.0        0
## X10650         100   2600   40000       0      0.0    42700
## X8705        88000  32100  125000       0  25800.0   219300
## X1490        68000  21600   29088       0      0.0   629600
## X341014      88711  25600   62000       0  38700.0   175611
## X1408515       200  11000       0       0  12500.0    -1300
## X16235        8370  52000    9000       0  89400.0    32970
## X2201516     44800  20800  125000       0  13200.0   277400
## X17115        9100  27700   49000       0  16500.0   144300
## X22110       41450   9800   20000       0 111800.0    62450
## X5075         9550   6600   87000       0      0.0   103150
## X3895       515650  20580  346000       0  54080.0   882150
## X18550        4850  15010   77000       0  56910.0    82950
## X1998017     45000  13000  100000       0      0.0   158000
## X10815        8960   9700   90000       0      0.0   148660
## X130         67800  10000  185000       0  40080.0   497720
## X15700       43950  62000   41000       0 190500.0   103450
## X10560         660   3700       0       0      0.0     4360
## X8180         9500   2600   20000       0      0.0    32100
## X6115        58000   7300       0       0   9850.0    55450
## X11495      182500  17100   47000       0  97100.0   222500
## X17710      187500  24000  125000    6000      0.0   342500
## X10510           0      0       0       0      0.0        0
## X10990       12500  11200   18000       0  31100.0    38600
## X13300        5220   5200   30000       0  93340.0    37080
## X19315        5605  11100   15000       0  72500.0    24205
## X10685           0   3400       0       0   2500.0      900
## X19330      188600   7300   60000   13000 101200.0   267700
## X16260          10   2100       0       0      0.0     2110
## X13945       13600   7900       0       0  98000.0    18500
## X2330       109400  14000   95000       0 100110.0   288290
## X12080         760   6100       0       0  11770.0    -4910
## X16900     2940400  60900  400000       0 124000.0  3697300
## X1080       108810  33000   52000       0  41300.0   320510
## X19180      867350  41500  301000       0 137000.0  1196850
## X2925       561000  13000  182000    9000 168000.0   765000
## X7555       131900   4900   80000       0      0.0   216800
## X16600       85700  37900  122000       0  36300.0   317300
## X16795        8450  38000    5000       0 152250.0    36200
## X16545      132500  17900  110000       0   8050.0   252350
## X20245        2000  22700    3000       0  36490.0    15210
## X9180       137000  36400   33000       0 102500.0   185900
## X16480      448100  38100  239000   50000  16000.0   825200
## X17355       67010  30000  490000       0 120000.0   527010
## X5875         7000   1300   40000       0      0.0    48300
## X16145       26200  11000  137000       0  40500.0   171700
## X21770           0      0       0       0      0.0        0
## X11820      298601  18000   44000       0  91000.0   360601
## X9390        50000   2500   59000       0      0.0   111500
## X12520        6000      0       0       0      0.0     6000
## X12040      128500  18000  300000       0      0.0   637500
## X8890        27700  37300   17000       0  98300.0    83900
## X150          7100  14000   10000       0  45330.0    29770
## X4600        21300  64000   72000       0  28000.0   183300
## X12490        1540   6300   16000       0  42800.0    20040
## X5640        11200   2900   31300       0   5700.0    45400
## X1758518     92100  22000   43000       0  92000.0   148900
## X3105      7885000  92000  900000       0      0.0  9102000
## X1070019    640000  22000  100000       0      0.0   836000
## X7035        26000   2500   63000       0 114000.0   124500
## X19950      949410  38100   30000       0 148750.0  1183760
## X12835        2515   5700       0       0   9540.0    22675
## X12050        5300   9100   -2000       0  84400.0    12000
## X12605       80210  48900   25000       0 119700.0    99410
## X16605       48500  21000  225000       0      0.0   294500
## X100             0      0       0       0      0.0        0
## X21095      386100  24300   70000       0 142560.0   454840
## X9640        14500   1200   21000       0  69260.0    36440
## X18820     2251700 121000  301000 2500000 157050.0  5165650
## X20565      384360  30900  148000       0 187000.0   478260
## X13035      449800  19600  175000       0      0.0   644400
## X15555     2187200 179000  170000 2500000      0.0 13036200
## X8315       286000   8000  300000       0      0.0   664000
## X990         13500  15700    6000       0  91100.0    33100
## X19305       20400      0       0       0      0.0    70400
## X14165      398500  15500       0       0      0.0   564000
## X1785        17300   5400   95000       0  26000.0   118700
## X22045       42000  12000  101000       0  42000.0   137000
## X31520        5400  21000    8000       0  58640.0    17760
## X21535       71150   4000   39000       0  11000.0   114150
## X8005        60000   2000       0       0      0.0   132000
## X21855        3230   4000       0       0   3800.0    78430
## X2965        93500  11600   33000       0  67930.0   137170
## X19925      150700  18600   85000       0   7400.0   246900
## X21305           0      0       0       0      0.0        0
## X11315     1450900  18400  173800   20000   1200.0  1663100
## X2870          200  16300   15300       0   3700.0    31800
## X7845          200  21000   -9000       0  72200.0   -16000
## X11325       30700  29000   39000       0 113400.0    55300
## X7135        28300      0       0       0  21900.0     6400
## X1223521     16000  16000       0       0  31000.0     1000
## X19955       87600   6800  235000       0      0.0   329400
## X12115       27500      0       0       0   1900.0    25600
## X20770        2000   5600       0       0      0.0     7600
## X695          1200   4300       0       0  15570.0   -10070
## X11320       85000  33400   65000       0  90000.0   183400
## X7080       464300  22000  138000       0      0.0   624300
## X21705        2000  17500    6000       0      0.0    25500
## X10855       51700  55500  288000       0 201410.0   355790
## X6340        20100  28600   64000       0 169000.0    99700
## X17450           0   2600       0       0      0.0     2600
## X2895         1001      0    1000       0    550.0     1451
## X8115       716000   5700  200000       0      0.0   921700
## X430         93810      0       0       0  30200.0    63610
## X1027522    274000  26500  106000    5000 180600.0   404900
## X387023      78700   9400   20000       0   9800.0    98300
## X13920         700  12000       0       0  12190.0      510
## X3160          300   2300       0       0    100.0     2500
## X19125      527700  44000  145000       0  30200.0  1066500
## X21480     1953150  44500  700000  220000      0.0  3427650
## X13180         750   4200       0       0   2180.0     2770
## X7970       111000   2500  125000       0      0.0   238500
## X11435       19100  25500   38000       0 118100.0    81500
## X16800         200      0       0       0    160.0       40
## X5575          100   7300  550000       0      0.0   557400
## X9880          700      0       0       0      0.0      700
## X13440          10   1200       0       0    330.0      880
## X17370         600  17400   60000       0 141200.0    66800
## X17200        6000  13200  250000       0      0.0   269200
## X3905       365800  48000  428000       0 123800.0   850000
## X14000           0      0       0       0      0.0        0
## X9710         4600      0       0       0   5000.0     -400
## X5300       282700  13000  100000       0      0.0   395700
## X12985        1200  13820   77000       0  25700.0   691320
## X2007524     30000   7100  170000       0  11030.0   196070
## X15575        4000   4500  112000       0  98200.0   120300
## X4245          400 101900   91000       0  11800.0   190500
## X21505           0  10100       0       0   1200.0     8900
## X4215          530      0       0       0  10000.0    -9470
## X12535         700   3800  165000       0   1200.0   168300
## X16475        6800   6300   60000       0      0.0    73100
## X4570           50  17400   32000       0  40100.0    37350
## X15300           0      0   10000       0    600.0     9400
## X18200       76210  27400   28000       0  68080.0   131530
## X2325       214900  23700  215000       0 151600.0   527000
## X3430       435900  14100   62000       0   3000.0   512000
## X7495          300   4100   31000       0      0.0    35400
## X489525      63100   9980   18000       0  62000.0    91080
## X1680026       200      0       0       0    160.0       40
## X21375      176601  35500       0       0  33200.0   178901
## X11115       14800  12500    1000       0  64000.0    19300
## X5220        22000   6900       0       0   1000.0    27900
## X1488027      6340   9900  106000       0  49000.0    98240
## X21940       18230  15900  174000       0  13700.0   194430
## X1364528    108000  17500  200000       0      0.0   325500
## X21040       11130  16000       0       0  29000.0    -1870
## X7125        39000  36000   70000       0  27710.0   117290
## X8670          510  11200       0       0  10100.0     1610
## X10640      231500  40600   30000       0 118250.0   260850
## X18375        3600  11000       0       0  11220.0     3380
## X20845        6505  20100    4000       0 124900.0   -18295
## X595        685000  17500  137000       0  38000.0   839500
## X1455       227000  15700  184000       0  74200.0   491000
## X8760            0      0       0       0      0.0        0
## X626029     379000  31000  389000       0 254400.0   794600
## X8475       631500  30700  500000       0      0.0  1162200
## X3085        30600  23700   12000       0  56300.0    58000
## X9285      1805000  14000  750000       0      0.0  2569000
## X6940         3580   9800   25000       0  84800.0    29580
## X557530        100   7300  550000       0      0.0   557400
## X16710        4500  20200   44000       0    850.0    67850
## X15515          20      0       0       0      0.0       20
## X3530       108200  12000  130000       0 120000.0   250200
## X6860         2300      0       0       0      0.0     2300
## X14630        6180   9900       0       0   4500.0    11580
## X14705       72760  11000   90000       0      0.0   173760
## X13010       13800      0       0       0    250.0    13550
## X1792031    111500  11000   88000       0  32340.0   200160
## X12705       60030  32500  400000       0   8200.0   484330
## X9870         3000  36800       0       0  21000.0    18800
## X17305        8950  19200   10000       0  75000.0    33150
## X21595        5000      0    2000       0  75400.0     -400
## X13725       24220  10000   70000       0  21000.0   103220
## X10040           0      0   88000       0      0.0    88000
## X7005        29000  16000   58000       0  32900.0   172100
## X3760        42720  24000       0       0   8100.0    58620
## X14910      473000  33000  147000       0 222000.0   624000
## X13365          10   5000   75000       0      0.0    80010
## X11410        2400   3800  261400       0  67600.0   267600
## X12220      992100  88000  250000   60000 250000.0  2390100
## X18420       24450  23100   17000   21000  93000.0    70550
## X9005          100   3200       0       0  10000.0    -6700
## X11855        7200      0   -3000       0 143500.0   -16300
## X21405           0   1800       0       0      0.0     1800
## X21260       14600   9100   30000       0    600.0    53100
## X8020          500      0       0       0      0.0      500
## X10370        4650   5800   10000       0  94100.0    16350
## X15255        3000   8200   50000       0   2000.0    59200
## X12735       45500  18000  153000       0  75300.0   198200
## X2635       133200  28400   32000       0  55000.0   208100
## X4765        11740   9000       0       0      0.0    25740
## X20295       62200  13000   16000       0 171600.0   533600
## X4030            0   3900       0       0   1100.0     2800
## X21360        3730  22300   59000       0 220300.0    75730
## X858032     101400   6700   24000       0  66300.0   128800
## X15240      159600  47000   73900       0 297100.0   389500
## X1052033      2300      0       0       0    600.0     1700
## X8230      5358500  49500  400000       0  10000.0  5843315
## X6565        87890  19400   71000       0  79000.0   178290
## X8210         4000  14500   75000       0    300.0    93200
## X16370       16300      0       0       0  52200.0    34100
## X2495        19500   6000       0       0    150.0    25350
## X4950        79300   9400   89000       0      0.0   206700
## X20625       60000      0       0       0 350000.0  3360000
## X12640       23300  18000   88000       0 173200.0   134100
## X16455      249000  26500  402000       0 186000.0   979500
## X20670          50   3500       0       0   6990.0     -790
## X9855            0      0       0       0      0.0        0
## X7590         3200   6100   35000       0  45100.0    39200
## X10390         160      0       0       0      0.0      160
## X6885       410000   2000  300000       0  18000.0   694000
## X12630      360000      0       0       0      0.0   360000
## X587534       7000   1300   40000       0      0.0    48300
## X6415       115050  32000  135000       0  11190.0   270860
## X13800      241500  20500  150000       0      0.0   412000
## X21210        5500  17000    9000       0  98700.0     8800
## X20775         700   5100       0       0   5000.0      800
## X16165        7200  59800   30000       0  43400.0   253600
## X249535      19500   6000       0       0    150.0    25350
## X18530       57500  15100  135000       0 150410.0   197190
## X1182036    298601  18000   44000       0  91000.0   360601
## X2485       133400  33800   26000       0 127000.0  1508200
## X16785        9900   1700   30000       0      0.0    92600
## X11750      155000  15100  210000   30000      0.0   440100
## X3025         7100  18000       0       0  29000.0    -3900
## X3470        19800  18900   75000       0  13000.0   100700
## X1860       236700  11300   12000       0 325500.0   422500
## X3920          800   5500       0       0  10000.0    -3700
## X19430        5000   3000   25000       0      0.0    33000
## X16535         400   3000       0       0   1240.0     2160
## X13620       22500   5400       0       0     50.0    27850
## X17880      171700  18300   11000       0  85700.0   174300
## X4875       220300  29200   60000       0 335000.0   419500
## X19300        4300   2700   16500       0  10400.0    21600
## X7075         7500      0       0       0  32500.0   -25000
## X15130     2330000  67400  165000  200000 110000.0  3342400
## X13555       77000  47000    8000       0 172000.0   212000
## X8385          400  20700       0       0  11380.0     9720
## X831537     286000   8000  300000       0      0.0   664000
## X1330         3940  20600       0       0  32330.0     1710
## X6710       125000  65200   60000       0 115700.0   194500
## X6055        20500  47000  600000       0   1800.0  1065700
## X20455     1384000  82000  350000   70000  70000.0  2643000
## X2025       558600  43800  441000       0 259000.0  1307480
## X8485        50800   4700   18000       0  52050.0    73450
## X6475         1000      0       0       0  10300.0    -9300
## X4305      1378000  23000  210000       0      0.0  1611000
## X6900       540000  48800  130000       0      0.0  1018800
## X14525      607900  36000  408000       0 468410.0  1253490
## X3070        55020   6200   36000       0  84500.0    96720
## X811538     716000   5700  200000       0      0.0   921700
## X1420       104500      0       0       0      0.0   104500
## X542039      12310  13700   -7000       0 168930.0   -32920
## X18380      748800  20200   30000       0      0.0  1184500
## X4185        51230   6100   90000       0      0.0   147330
## X13830       26020   3200       0       0  14000.0    15220
## X6590       134930  26900   55000       0 141070.0   195760
## X13340       10500   4200       0       0   6080.0     8620
## X5625       197000  32600  185000       0 324570.0   480030
## X9625       285100  11900  258000       0  30000.0   597000
## X12020       24000   2600  180000       0      0.0   206600
## X9580       527330  12300  212000       0 638000.0  1751630
## X277040      11400   6300  115600       0  41400.0    96300
## X50          71000  12000       0       0  25110.0    57890
## X369541      40100   5900    8000       0 169000.0    54000
## X7540        39300   3200       0   15000      0.0    57500
## X1030         2000      0       0       0      0.0     2000
## X14400           0      0   50000       0      0.0    50000
## X7415        61000   6500   96000       0 105300.0   169200
## X3990            0      0       0       0      0.0        0
## X3245        79300  16000   70000       0      0.0   165300
## X2575        23200   5600       0   30000 142120.0   751680
## X9105         1500   5700   15000       0  17400.0    34800
## X7985         3240  48400   63000   22000 118700.0   158440
## X1300        24910  15000  215000       0      0.0   254910
## X4760       144300  33300  259000   20000 181800.0   460800
## X16305          60   2600       0       0   5300.0    -2640
## X21035      216000   4800   70000       0  80800.0   290000
## X2905        43000      0       0       0      0.0    43000
## X1610       100500  14000   50000       0   2000.0   162500
## X3490        31500  17000   37000       0  93000.0   335500
## X16585          50      0       0       0   2300.0    -2250
## X4145        30000   4900   30000       0      0.0    64900
## X3135            0  14000   94000       0  56000.0   108000
## X6000        37500   4300       0       0      0.0    41800
## X10420       10000  15000       0       0      0.0    25000
## X1655            0  10200   32000       0  46900.0    43300
## X10705        4000      0       0       0      0.0     4000
## X11735      388900  48800  230000       0  70000.0   667700
## X6720         1700   4700       0       0      0.0     6400
## X12680        6150  11700   84000       0  36000.0    71850
## X7530         7200  29200   30000       0  12000.0    54400
## X7795         1421   9100       0       0   7800.0     2721
## X1480        23850  14000       0       0      0.0    37850
## X21575      227540  29000  100000       0 153000.0   333540
## X2585        47000  28100   30000       0  80350.0   261050
## X16595           0      0       0       0      0.0        0
## X4040          100   3600   35000       0    150.0    38550
## X1630542        60   2600       0       0   5300.0    -2640
## X16330       12840   4800       0       0    300.0    17340
## X15665       72640  24000   60000       0  26900.0   269740
## X690043     540000  48800  130000       0      0.0  1018800
## X6795          400  17200   55000       0 109900.0   222700
## X21350        3400   3200  113000    1500  17220.0   119880
## X6700       658000  21000  155000       0  95000.0  2343000
## X18665      103000  20300   57000       0  63000.0   180300
## X19580      192800  44000   37000    1200  72900.0   240100
## X20130       16200  26600    6000       0 101100.0    36700
## X10325        5720   6600       0       0   6500.0     5820
## X4130       798780  23000   72000       0 166000.0  1403780
## X5475        36990  37900  175000       0   8220.0   316170
## X1790         1500  22900  150000       0      0.0   196900
## X17480       31650  11600  172000       0 102240.0   210010
## X12830       52600  21000       0       0  10100.0    63500
## X1865        25500  12800   44000       0  32300.0    81000
## X768044      12700   4200    8000       0  92000.0    24900
## X457045         50  17400   32000       0  40100.0    37350
## X11490       28720  25000  200000       0      0.0   253720
## X1146546      1600  12800       0       0    800.0    13600
## X10180        3100   2700       0       0  17500.0   -11700
## X3910       647000   7100       0       0      0.0   786600
## X11565       33750  35000   91000       0 147370.0   291380
## X21825        2100  28000       0       0  19600.0    10500
## X4525         1000   2300       0       0      0.0     3300
## X3060         1300   1800   48000       0      0.0    51100
## X9250       269930  11000   10000       0  80000.0   290930
## X17500       65000  34500  175000       0    450.0   299050
## X1100            0  17700       0       0   7000.0    10700
## X16025        2451  19200  200000       0 150000.0   221651
## X12380      163500  23000   79000       0 147700.0   263800
## X753547      31251  16100       0       0  24700.0    24951
## X12850        9100  18000       0       0  11240.0    15860
## X1955548     18650  30300   64000       0 125900.0   383050
## X19775        2750   3000       0       0      0.0    30750
## X11525         391      0       0       0    810.0     -419
## X2975          600   5000   55000       0    770.0    60830
## X18895        2600   6900  125000       0  23600.0   130900
## X1602549      2451  19200  200000       0 150000.0   221651
## X345          9620  14000    9000       0  50000.0    32620
## X490         18200  29000   41000       0 259000.0    88200
## X14580    22970000  89000  700000       0   2000.0 26208000
## X10875      102500  38000  100000       0  78150.0   222350
## X5270         7390   9800       0       0  31900.0   -14710
## X9400          230   3400   40000       0   2360.0    41270
## X12900       51400  15400   90000       0    380.0   168420
## X4530        43500   2400   75000       0      0.0   120900
## X17670           0      0  130000       0      0.0   130000
## X5440         4330  26700   28000       0  97080.0    33950
## X8875       138530  41600   41000       0 137250.0   232880
## X2060       147940   2800  180000       0      0.0   330740
## X2153550     71150   4000   39000       0  11000.0   114150
## X5080        11500  25100   78000       0  43740.0    81860
## X12500       61300  43000  133000       0 119000.0   235300
## X830           100      0       0       0      0.0      100
## X495051      79300   9400   89000       0      0.0   206700
## X1304052     19880  13500       0       0   4400.0    28980
## X16685         500   6600       0       0   4800.0     2300
## X5695       145900  14000   55000       0  10180.0   204720
## X2026553     19220   4300       0       0      0.0    23520
## X215        101050   4700   70000       0  55000.0   175750
## X7460        77300  24100    5000       0 154000.0    57400
## X21060      100500      0       0       0      0.0   100500
## X3770       160600  19500   87000       0      0.0   267100
## X940054        230   3400   40000       0   2360.0    41270
## X15320        7600   9900       0       0   7940.0     9560
## X96555       49600   3400       0       0     20.0    52980
## X19340       36000  12000  100000       0      0.0   148000
## X1395         1800  20600       0       0   7810.0    14590
## X939056      50000   2500   59000       0      0.0   111500
## X5245         1430   8400   32000       0   3080.0    41750
## X18830        2650   6000   24000       0  33710.0    30940
## X15215       72000  19000   58000       0 132400.0   133600
## X496557          0   3600       0       0  15000.0   -11400
## X12210       23500   5200       0    6000      0.0    34700
## X17560       17700  24000   69000       0 101930.0    99770
## X19625       38700  37200  124000       0  21000.0   199900
## X2530        50000   6100   50000       0    590.0   105510
## X9075        16000  20000   17000       0  80400.0    37600
## X1925        41450   2200   30000       0      0.0    73650
## X21010      140000  23100   92000       0      0.0   263100
## X1745058         0   2600       0       0      0.0     2600
## X17555        5700   4200   38000       0  81000.0    40900
## X2018059      3350  49700   10000       0 156300.0    39750
## X5330      1574000  61000 1200000       0      0.0  4275000
## X2150560         0  10100       0       0   1200.0     8900
## X2970        12600      0       0       0      0.0    12600
## X19190       51900  10000   76000    2000 179200.0   134700
## X12570        5450   5300   10000       0  28480.0    20270
## X1325         6600  16600   56000       0  90700.0   481500
## X4195        70000  26600   67000       0 112000.0   139600
## X20915        3000   5000  105000       0  52000.0   106000
## X14145      749300  52900   47000       0  84000.0   848200
## X13090         300   4500       0       0  10100.0    37200
## X2211061     41450   9800   20000       0 111800.0    62450
## X13062       67800  10000  185000       0  40080.0   497720
## X7190           10   1400       0       0    900.0      510
## X10690         200      0   41000       0  35000.0    41200
## X21495        2000  10700   26000       0  24000.0    38700
## X3745        11370  22900   38000       0  83650.0    60620
## X2315        58310   5100       0       0    500.0    62910
## X3170          160   3000       0       0      0.0     3160
## X10940      293900   9200  250000       0      0.0   553100
## X2116563    426300  79200   60000   70000 143000.0   632500
## X2109564    386100  24300   70000       0 142560.0   454840
## X233065     109400  14000   95000       0 100110.0   288290
## X17530       37000  13800   65000       0 100280.0   115520
## X12410     3084500  59000  876000       0      0.0  4569500
## X1694566    413300  20800   90000       0    180.0   553920
## X1250       205280  48000  115000       0  21550.0   346730
## X1323567     25000      0       0       0      0.0   525000
## X17190      100950  31000  271000       0 134675.0   397275
## X103068       2000      0       0       0      0.0     2000
## X1018069      3100   2700       0       0  17500.0   -11700
## X18295         300   3800       0       0   6000.0    -1900
## X8770        11300   7400  220000       0      0.0   238700
## X585         65300  17000   48000       0 172250.0   120050
## X8750         2000  28100   21000       0  90800.0  1094300
## X13955        7900  15500  222000       0  29400.0   254000
## X18825     8551000   4500 1500000       0  55300.0 35953200
## X12280       41500  46700   23000       0 118500.0    95700
## X21780      161000  10250   12000       0 113200.0   183050
## X17810      246000  41600  132000       0  83000.0   444600
## X2535       226000   3300   90000       0  50050.0   319250
## X1614070    226800  12000  100000       0      0.0   338800
## X17290       32510      0   75700       0   9300.0   108210
## X15400      605700  31000  240000       0   2400.0  2897300
## X1842071     24450  23100   17000   21000  93000.0    70550
## X1650         5000   2700   82000       0  54700.0   438000
## X2050072         0      0  163000       0      0.0   163000
## X360         18900      0   -4000       0  64200.0    13700
## X1895       185850  24700   98000       0 102000.0   308550
## X7350         1020  12300   10000       0   6720.0    16600
## X107573     158090  14300   37000    2500      0.0   223890
## X2204574     42000  12000  101000       0  42000.0   137000
## X10345          20   4000       0       0  64960.0   -60940
## X90575         700      0       0       0      0.0      700
## X75          60000  24900   50000       0 123000.0   112900
## X1550       176860   7300   77000       0  29210.0   251950
## X6040          500   2300    7000       0    750.0     9050
## X3525       820750  17700  320000       0    800.0  1219650
## X6980       169940  34900  220000       0  31660.0   418180
## X178576      17300   5400   95000       0  26000.0   118700
## X16935        9200      0       0    2500      0.0    11700
## X928577    1805000  14000  750000       0      0.0  2569000
## X9615            0   3000     600       0  81400.0   -25400
## X17940        2030  15500   32000       0   9750.0    39780
## X2855            0      0       0       0      0.0        0
## X17900       22590  14100   25000       0 119900.0   -58210
## X1222078    992100  88000  250000   60000 250000.0  2390100
## X2020        37550  34000   98600       0  29870.0   141680
## X10380           0   6700       0       0   9300.0    -2600
## X4000        29100  67400  135000       0      0.0   231500
## X90           5900  24100   20000       0  92800.0    26200
## X19145      124100   6800   62000       0  48000.0   192900
## X9140         3000   2500       0       0      0.0    14500
## X2300           35   5000       0       0   3200.0     1835
## X13560         730   3900       0       0   1000.0     3630
## X1767079         0      0  130000       0      0.0   130000
## X9665        67520  28000   21000       0  96300.0    99220
## X9065         5000   2200   40000       0   6200.0    41000
## X12715       12380  55800       0       0  23430.0    44750
## X1069080       200      0   41000       0  35000.0    41200
## X15150       48000  14000       0       0  16180.0    45820
## X14780     3447000  22000  123000       0 102000.0  3592000
## X3080       123750  24200   67000       0 146500.0   148450
## X1023081     10000  10000  159000       0  41000.0   179000
## X9725        38700  18900   77000       0  56300.0   116300
## X1330082      5220   5200   30000       0  93340.0    37080
## X3215       339000  15300  127000       0  98000.0   481300
## X1069083       200      0   41000       0  35000.0    41200
## X19635           0      0       0       0    601.0     -601
## X14800        3300  15200   26000       0  88000.0    30500
## X2105584    104220   3900  135000       0   7000.0   254120
## X21380           0      0       0       0   1190.0    -1190
## X2024585      2000  22700    3000       0  36490.0    15210
## X11040       10250  11400   38000       0  35600.0    58050
## X12070       23220  16000   69000       0   6400.0   101820
## X3465        70410  36000   64000       0 117000.0   149410
## X20725           0      0   52000       0      0.0    52000
## X15730       49100  33600   84000       0 274440.0   137260
## X17005      603500  22000  300000       0 615000.0  3060500
## X4065       124440   8700   54000       0   6000.0   198140
## X620           300      0   40000       0      0.0    40300
## X1776586     12000   4500       0       0      0.0    16500
## X2715         1900   7200   55000       0 111000.0    58100
## X5210        27300  14000   40000       0 136000.0    75300
## X1303587    449800  19600  175000       0      0.0   644400
## X18625      280210  26000   93000       0  72050.0   392160
## X20460         300   6400       0       0   2100.0     4600
## X4700        20000  12200   96000       0  11400.0   120800
## X256588     430000  26600   70000       0   5000.0   526600
## X4180      2499000  49000  773000       0 302000.0  3259200
## X5760       196450   4900  134000       0 231500.0   354850
## X286589     167500   9300   75000       0      0.0   251800
## X5745        31700   9900       0       0  54500.0    30100
## X5175          360   8300       0       0   2200.0     6460
## X15105       47010   9900   23000       0  30150.0    76760
## X19895         370  18000       0       0      0.0    68370
## X1210       111000  42000   71000       0  81000.0   212000
## X1998090     45000  13000  100000       0      0.0   158000
## X202091      37550  34000   98600       0  29870.0   141680
## X14570       12300  15800   23000   15000  82400.0    49700
## X2100        20480   5300       0       0      0.0    25780
## X1208092       760   6100       0       0  11770.0    -4910
## X21340       28640  22200       0       0    490.0    50350
## X14250         200      0       0       0  46400.0    -7200
## X1719093    100950  31000  271000       0 134675.0   397275
## X2105594    104220   3900  135000       0   7000.0   254120
## X11425      386330  34100  105000       0 125000.0   525430
## X2135095      3400   3200  113000    1500  17220.0   119880
## X13565       40050  17900   55000       0  35000.0   112950
## X17540      691800  31400  425000       0      0.0  1148200
## X2985        19000   4100       0       0  63000.0    23100
## X15070        7680      0       0       0  19260.0   -11580
## X3505        22800  18200  140000       0      0.0   249000
## X15015        1530  13000   30000       0  50930.0    43600
## X16815         500      0       0       0   9600.0    -9100
## X7485        27300   3200   20000       0  64000.0    50500
## X18460      154500  13000   50000       0  53450.0   264050
## X9465        10000      0       0       0  12000.0    60000
## X10825        2550   5500       0       0      0.0     8050
## X8105       197150  10000   66000    6000 189000.0   274150
## X5820       147000  31000   39000       0  92900.0   208100
## X14765      417500  46000   90000       0   1200.0   552300
## X5340      1168010  13000       0   25000      0.0  1286010
## X3720            0      0       0       0      0.0        0
## X4475        47350      0  150000       0      0.0   197350
## X15185        8420   5600   16000       0  55500.0    28520
## X68096      327800  43000  220000       0      0.0   590800
## X13745           0   2900   -6000       0  26000.0    -3100
## X125097     205280  48000  115000       0  21550.0   346730
## X8345         3600  17700       0       0   6830.0    14470
## X2435          285  10000       0       0   1590.0     8695
## X1900       650200  22000   80000       0      0.0   752200
## X11670       40200      0       0       0      0.0    40200
## X18465       63500  25500   91000       0  28000.0   161000
## X20605       38100   2500       0    3000    600.0    43000
## X1127598    351300  12500  200000       0      0.0   563800
## X15815        2100  26960   95000       0  75300.0   267460
## X4465        99600   8500  160000       0 100000.0   319100
## X14585      137500  17600   90000    2000  15700.0   241400
## X12930       24850  10300   10000       0 126000.0    39150
## X3875        57000  10000       0       0  27560.0    39440
## X11340      848300 103100  100000       0   1400.0  1990000
## X4985       116880  20200  106000       0      0.0   243080
## X21245           1   6300   20000       0   1000.0    25301
## X1203599      1170   8900   28000       0 108800.0    26270
## X18640        3150   4500    9000       0  10260.0     6390
## X3875100     57000  10000       0       0  27560.0    39440
## X7570         3650   9100       0       0   9200.0     3550
## X16115      359100  14100  107000       0  50000.0   473200
## X6355            0   6600       0       0   3300.0     3300
## X17615         430  11500   31000       0 117450.0    34480
## X6920            0      0       0       0      0.0        0
## X8960         7370      0       0       0   2500.0    59870
## X2325101    214900  23700  215000       0 151600.0   527000
## X255         43000   8800  345000       0  80000.0   424800
## X19515       58110  19000       0       0   9900.0    74210
## X19205102    57550  13000  136000       0  60200.0   200350
## X15825        1000  14900   60000       0 115000.0   115900
## X9090        19080  26200   68000       0  32200.0   113080
## X4540         9400      0   64000       0      0.0    73400
## X15225     2061430  85000 1130000       0 677900.0  3268530
## X10300         120      0       0       0      0.0      120
## X21650        2300  11400   13332       0  83930.5   103100
## X3780           10      0       0       0      0.0       10
## X13835        2500  27000   20000       0  72300.0    27200
## X5100       139000  13000  273000       0  99900.0   422100
## X11025       22700  24000   40000       0  96000.0    70700
## X13740       37700   9000   19000       0  41000.0    65700
## X17905        7060   9800    3000       0 142140.0   -35280
## X925          2500  13000       0       0      0.0    15500
## X11925         510      0  164900       0   5350.0   165160
## X5210103     27300  14000   40000       0 136000.0    75300
## X14005           0      0       0       0    890.0     -890
## X17815       18600      0   65000       0      0.0    83600
## X14200       23500  14200   89000       0   8500.0   118200
## X2855104         0      0       0       0      0.0        0
## X310105       2500   2000       0       0    600.0     3900
## X15355      108200  16300   10000       0      0.0   134500
## X15135       23800  20900   32000       0  43800.0    71400
## X9020        19950  12000   80000       0 129900.0   102050
## X18630       32820  13000   34000       0 150800.0    50020
## X17315        2505  12500       0       0   4430.0    10575
## X19685      111300  13000   -2000       0  24900.0   112400
## X7100          620  11700    9200       0   4550.0    17770
## X12945106     4000   9400       0       0   1500.0    11900
## X7655       468000      0  353000       0 364000.0  8743300
## X20875        5450      0       0       0      0.0     5450
## X9300         1910   6700   22000   75000  63880.0   101730
## X16905107   680430   9900  200000       0      0.0  1020330
## X155          7100  13900       0       0 104400.0     7640
## X15280        1700      0       0       0    250.0     1450
## X17385        1200      0       0       0      0.0     1200
## X12880        6210  12600       0       0  23780.0    -4970
## X1595       310090  33100  113000       0 102000.0   456190
## X5720         5200      0   26000       0  60100.0    30100
## X17375      339500  28900  170000       0  70000.0   538400
## X11495108   182500  17100   47000       0  97100.0   222500
## X7680109     12700   4200    8000       0  92000.0    24900
## X2590        10170   9900       0    5000  13780.0    11290
## X7200         9700   2400   76000       0    410.0    87690
## X1575        81610  30000   23000       0  88280.0   108330
## X12065           0      0       0       0      0.0        0
## X9715        86500  29000   25000       0 158650.0   136850
## X5065            0      0       0       0    340.0     -340
## X9520         3300      0       0       0   4000.0     -700
## X20565110   384360  30900  148000       0 187000.0   478260
## X2100111     20480   5300       0       0      0.0    25780
## X5175112       360   8300       0       0   2200.0     6460
## X15880       21300  33800  121000       0 161300.0   143800
## X615          1000   7500       0       0   4400.0     4100
## X19490      201000  27900   47000       0 184900.0   244000
## X13850        2400      0       0       0   1100.0     1300
## X14070       21920   5800   78000       0   1600.0   104120
## X16555113    13000   1800   15000       0      0.0    29800
## X21750        1500  12000   40000       0  57000.0    46500
## X1305       105300  21000   75900       0   4100.0   202200
## X5210114     27300  14000   40000       0 136000.0    75300
## X14400115        0      0   50000       0      0.0    50000
## X4120       125500   6700  983400       0      0.0  1819700
## X13600       57500  12000  147000       0 129000.0   205500
## X1670          800   7800   63000       0 157000.0    51600
## X8790       111900  30000  159000       0 133000.0   278900
## X8150        87750   3800   45000       0  51300.0   120250
## X14155       42500  16700  150000       0      0.0   439200
## X17905116     7060   9800    3000       0 142140.0   -35280
## X16735       34700  11000       0       0   3000.0    83700
## X21095117   386100  24300   70000       0 142560.0   454840
## X10280       12700   7000    6200   10000   2100.0    33800
## X8695         5770   6900       0       0   2160.0    10510
## X15485       12000      0   60000       0 195000.0    67000
## X920118      93000   7700   59000       0  66000.0   159700
## X525         10300  12000   50000       0      0.0    72300
## X10740119        0      0  102000       0  28600.0   101400
## X8885          490  13000       0       0  12800.0      690
## X20200        2300   6500       0       0  38200.0   -29400
## X2295          720   6200  112000       0  39000.0   112920
## X14855         800      0       0       0      0.0      800
## X20390      105000   1600   60000       0 160000.0   166600
## X13895       12340  26000   43000       0  68940.0    62400
## X12335      271000  17500  200000       0   2900.0   485600
## X11880       62000  23700   65000       0   1500.0   149200
## X3750        17500   3800       0       0   3000.0    24300
## X16305120       60   2600       0       0   5300.0    -2640
## X11875       42100   1900   50000       0  26300.0    67700
## X7670121     12200  34000   22000       0  60600.0    45600
## X6130            0    950       0       0      0.0      950
## X8050          150  22900   21000       0  55600.0    17450
## X17970           0      0       0       0      0.0        0
## X18735          50   4100       0       0      0.0     4150
## X6920122         0      0       0       0      0.0        0
## X235         20000  12700       0       0  63740.0   -31040
## X5530       231500  20500       0       0  16000.0   236000
## X18720       15400  12120       0       0      0.0    97520
## X2330123    109400  14000   95000       0 100110.0   288290
## X335         11800   5000       0       0   2350.0    14450
## X19405        4840  23500   15000       0 146950.0    36390
## X2615         4100  13400   63000       0  76900.0    78600
## X8060       437200  36100  400000       0      0.0   873300
## X7990          630   5000       0       0   8300.0    -2670
## X17205        1600      0   17000       0 133480.0    18120
## X110         25800      0   11000       0  58000.0    34800
## X16470124   400060  26000  380000       0      0.0  2856060
## X8690        36000  14100   34000       0  85960.0    80140
## X3350       198400  23000   80000   30000  40000.0   331400
## X2440        11070  17000       0       0  23260.0     4810
## X9570          100   3300       0       0      0.0     3400
## X19650       79650   7700   23000       0 113200.0   109150
## X12680125     6150  11700   84000       0  36000.0    71850
## X6175         4800   9800   80000       0    220.0    94380
## X2860126      4830   2800       0       0    650.0     6980
## X21470         600   1900       0       0   2000.0      500
## X9360        13400  10000       0       0  11400.0    12000
## X3235127      6200  20100   70000       0  24670.0    91630
## X10540     1370300  20800  122000       0      0.0  1523100
## X18595      293500  37800  202000       0 218000.0   819300
## X13935     1038000  24800  370000   15000  30000.0  1447800
## X16950      263700  17700  115000       0   4400.0   392000
## X9715128     86500  29000   25000       0 158650.0   136850
## X11980         200   5800    3000       0  17000.0     9000
## X2345        79020  13000   25000   10000  70700.0   117320
## X21130129      310   4400   83000       0      0.0    87710
## X12600        3100   4200       0       0      0.0     7300
## X3470130     19800  18900   75000       0  13000.0   100700
## X9425        19800   2500       0       0      0.0    22300
## X21625        1000  21000       0       0  13000.0     9000
## X13110       32660      0   46000       0  61900.0   689760
## X10765      115770   9400   85000    8000  60000.0   218170
## X10290      163300  12000   72000       0 116000.0   262300
## X20650      276200   1800  400000       0      0.0   678000
## X20680       87500      0       0       0  40110.0    47390
## X20325      130300  18800   17000       0  58260.0   160840
## X15740         750   1900       0       0      0.0     2650
## X10040131        0      0   88000       0      0.0    88000
## X2085          780  17600       0       0      0.0    18380
## X18375132     3600  11000       0       0  11220.0     3380
## X15970     1273000  19000  300000       0      0.0  1592000
## X15490       65850  10850   68568       0      0.0   157900
## X9805          200      0   36000       0   8000.0    28200
## X19805       12070  15700       0    1000  26950.0     1820
## X6710133    125000  65200   60000       0 115700.0   194500
## X20265134    19220   4300       0       0      0.0    23520
## X16850          20   4100       0       0 100100.0   -95980
## X10875135   102500  38000  100000       0  78150.0   222350
## X10560136      660   3700       0       0      0.0     4360
## X19625137    38700  37200  124000       0  21000.0   199900
## X13545        5300   1900   70000       0  15000.0    77200
## X13725138    24220  10000   70000       0  21000.0   103220
## X13385       10000  20000   65000       0 147000.0   474000
## X16935139     9200      0       0    2500      0.0    11700
## X4930        63000      0   65000       0      0.0   308000
## X20930       50830  21200   41000       0  28960.0   103070
## X17100       29300  46400  179000   35000  15000.0   354700
## X18795       42200  14300   57000       0   3000.0   145500
## X1315         1550      0       0       0    170.0     1380
## X3990140         0      0       0       0      0.0        0
## X6590141    134930  26900   55000       0 141070.0   195760
## X10940142   293900   9200  250000       0      0.0   553100
## X17560143    17700  24000   69000       0 101930.0    99770
## X300         16260   7500       0       0   2000.0    21760
## X11475        9610  26600       0       0  54880.0   -18670
## X15370        6640   1700       0       0   1800.0     6540
## X12230         130   2000       0       0      0.0     2130
## X6570          400   9200   50000       0  27550.0   157050
## X13610          90  29600    9200       0  19760.0    29930
## X1940        14500   4000       0       0  19800.0    -1300

Next, assign to income the column INCOME in the cfb data frame, and determine the mean and median income values.

income <- cfb$INCOME
mean(income)
## [1] 63402.66
median(income)
## [1] 38032.7

The first output is the mean income and the second is the median income. Mean income is greater than median income. This indicates there are more small income values than large income values, but some of the large income values are very large.

This ‘skewness’ in the distribution of values can be seen on a histogram. A histogram is a plot that displays the frequency of the values using intervals that divide the values into equal bins.

This is done with the hist() function. Here you specify the number of intervals with the breaks = argument.

hist(income, 
     breaks = 25)

The distribution is said to be right skewed. It has a long right tail.

Note: Some packages come with data sets. To see what data is available in a package, type

data(package = "UsingR")

Spread

A simple measure of the spread of data values is the range. The range is given by the minimum and maximum value or by the difference between them.

range(income)
## [1]       0 1541866
diff(range(income))
## [1] 1541866

Or using the central tendency as the center of a set of values, you can define spread in terms of deviations from the center.

The sum of the squared deviations from the center divided by sample length minus one is the sample variance.

var(income)
## [1] 13070833215
sqrt(var(income))
## [1] 114327.7
sd(income)
## [1] 114327.7

To illustrate consider two sets of test scores.

ts1 <- c(80, 85, 75, 77, 87, 82, 88)
ts2 <- c(100, 90, 50, 57, 82, 100, 86)

Some test score statistics are

mean(ts1)
## [1] 82
mean(ts2)
## [1] 80.71429
var(ts1)
## [1] 24.66667
var(ts2)
## [1] 394.2381

Vector types

All the elements of a vector must have the same type. That is you can’t mix numbers with character strings.

Consider the following character strings.

simpsons <- c("Homer", "Marge", "Bart", "Lisa", "Maggie")
simpsons
## [1] "Homer"  "Marge"  "Bart"   "Lisa"   "Maggie"

Note that character strings are made with matching quotes, either double, ", or single, ’.

If you mix element types within a data vector, all elements will change into the ‘lowest’ common type, which is usually a character. Arithmetic does not work on character elements.

Returning to the land falling hurricane counts.

cD1 <- c(2, 3, 0, 3, 1, 0, 0, 1, 2, 1)   
cD2 <- c(0, 5, 4, 2, 3, 0, 3, 3, 2, 1)

Now suppose the National Hurricane Center (NHC) reanalyzes a storm, and that the 6th year of the 2nd decade is a 1 rather than a 0 for the number of landfalls. In this case you type

cD2[6] <- 1

The assignment to the 6th element in the vector cD2 is done by referencing the 6th element of the vector with square brackets [].

It’s important to keep this in mind: Parentheses () are used for functions and square brackets [] are used to get values from vectors (and arrays, lists, etc). REPEAT: [] are used to extract or subset values from vectors, data frames, matrices, etc.

Print out all the elements of a data vector, print the 2nd element, the 4th element, all but the 4th element, all odd number elements.

cD2
##  [1] 0 5 4 2 3 1 3 3 2 1
cD2[2]  
## [1] 5
cD2[4]
## [1] 2
cD2[-4]
## [1] 0 5 4 3 1 3 3 2 1
cD2[c(1, 3, 5, 7, 9)] 
## [1] 0 4 3 3 2

R’s console keeps a history of our commands. The previous commands are accessed using the up and down arrow keys. Repeatedly pushing the up arrow will scroll backward through the history so you can reuse previous commands.

Many times you wish to change only a small part of a previous command, such as when a typo is made. With the arrow keys you access the previous command then edit it as desired.

Structured data

When data are in a pattern; for instance the integers 1 through 99. The colon : function is used for creating simple sequences.

1:100
##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
##  [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
##  [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
##  [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
##  [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
##  [91]  91  92  93  94  95  96  97  98  99 100
rev(1:100)
##   [1] 100  99  98  97  96  95  94  93  92  91  90  89  88  87  86  85  84  83
##  [19]  82  81  80  79  78  77  76  75  74  73  72  71  70  69  68  67  66  65
##  [37]  64  63  62  61  60  59  58  57  56  55  54  53  52  51  50  49  48  47
##  [55]  46  45  44  43  42  41  40  39  38  37  36  35  34  33  32  31  30  29
##  [73]  28  27  26  25  24  23  22  21  20  19  18  17  16  15  14  13  12  11
##  [91]  10   9   8   7   6   5   4   3   2   1
100:1
##   [1] 100  99  98  97  96  95  94  93  92  91  90  89  88  87  86  85  84  83
##  [19]  82  81  80  79  78  77  76  75  74  73  72  71  70  69  68  67  66  65
##  [37]  64  63  62  61  60  59  58  57  56  55  54  53  52  51  50  49  48  47
##  [55]  46  45  44  43  42  41  40  39  38  37  36  35  34  33  32  31  30  29
##  [73]  28  27  26  25  24  23  22  21  20  19  18  17  16  15  14  13  12  11
##  [91]  10   9   8   7   6   5   4   3   2   1

It’s often necessary to specify either the step size and the starting and ending points or the starting and ending points and the length of the sequence. The seq() function does this.

seq(from = 1, to = 9, by = 2)
## [1] 1 3 5 7 9
seq(from = 1, to = 10, by = 2)
## [1] 1 3 5 7 9
seq(from = 1, to = 9, length = 5)
## [1] 1 3 5 7 9

To create a vector with each element having the same value use the rep() function (replicate). The simplest usage is to replicate the first argument a specified number of times.

rep(1, times = 10)
##  [1] 1 1 1 1 1 1 1 1 1 1
rep(1:3, times = 3)
## [1] 1 2 3 1 2 3 1 2 3

More complicated patterns can be repeated by specifying pairs of equal-sized vectors. In this case, each element of the first vector is repeated the corresponding number of times specified by the element in the second vector.

rep(c("long", "short"), times = c(1, 2))
## [1] "long"  "short" "short"

Asking questions

To find the most landfalls in the first decade, type:

max(cD1)
## [1] 3

Which years had the most?

cD1 == 3
##  [1] FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE

Notice the double equals signs (==). This tests each value (element) in cD1 to see if it is equal to 3. The 2nd and 4th values are equal to 3 so TRUEs are returned. Think of this as asking R a question. Is the value equal to 3? R answers all at once with a vector of TRUEs and FALSEs.

How do you get the vector element corresponding to the TRUE values? That is, which years have 3 landfalls?

which(cD1 == 3)
## [1] 2 4

The function which.max() can be used to get the first maximum.

which.max(cD1)
## [1] 2

You might also want to know the total number of landfalls in each decade and the number of years in a decade without a landfall. Or how about the ratio of the mean number of landfalls over the two decades.

sum(cD1)
## [1] 13
sum(cD2)
## [1] 24
sum(cD1 == 0)
## [1] 3
sum(cD2 == 0)
## [1] 1
mean(cD2) / mean(cD1)
## [1] 1.846154

There are 85% more landfalls during the second decade. Is this increase statistically significant?

To remove an object from the current environment you use the rm() function. Usually not needed unless you have very large objects (e.g., million cases).

rm(cD1, cD2)

Tables and summaries

All elements of a vector must be of the same type. For example, the vectors A, B, and C below are constructed as numeric, logical, and character, respectively.

First create the vectors then check the class.

A <- c(1, 2.2, 3.6, -2.8) 
B <- c(TRUE, TRUE, FALSE, TRUE)
C <- c("Cat 1", "Cat 2", "Cat 3")
class(A)
## [1] "numeric"
class(B)
## [1] "logical"
class(C)
## [1] "character"

With logical and character vectors the table() function indicates how many occurrences for each element type. For instance, let the vector wx denote the weather conditions for five forecast periods as character data.

wx <- c("sunny", "clear", "cloudy", "cloudy", "rain")
class(wx)
## [1] "character"
table(wx)
## wx
##  clear cloudy   rain  sunny 
##      1      2      1      1

The output is a list of the character strings and the corresponding number of occurrences of each string.

As another example, let the vector ss denote the Saffir-Simpson category for a set of five hurricanes.

ss <- c("Cat 3", "Cat 2", "Cat 1", "Cat 3", "Cat 3")
table(ss)
## ss
## Cat 1 Cat 2 Cat 3 
##     1     1     3

Here the character strings correspond to different intensity levels as ordered categories with Cat 1 < Cat 2 < Cat 3. In this case convert the character vector to an ordered factor with levels. This is done with the function factor().

ss <- factor(ss, order = TRUE)
class(ss)
## [1] "ordered" "factor"
ss
## [1] Cat 3 Cat 2 Cat 1 Cat 3 Cat 3
## Levels: Cat 1 < Cat 2 < Cat 3

The vector object is now an ordered factor. Printing the object results in a list of the elements in the vector and a list of the levels in order. Note: if you do the same for the wx object, the order is alphabetical by default. Try it.

Tuesday, September 6, 2022

Today

  • Getting data into R
  • Data frames
  • Quantiles
  • Pipes

More information about how to use RStudio and markdown files is available here: https://www.pipinghotdata.com/posts/2020-09-07-introducing-the-rstudio-ide-and-r-markdown/

Getting your data into R

You need to know two thing: (1) where the data are located, and (2) what type of data file is it.

Consider the file US.txt located in your project folder. It is in the same folder as this file (05-Lesson.Rmd). Click on the file name. It opens a file tab that shows a portion of the file.

It is a file with the column headings Year, All, MUS, G, FL, E. Each row is a year and the count is the number of hurricanes making landfall in the United States. All indicates anywhere in the continental U.S, MUS indicates at major hurricane intensity (at least 33 m/s). Each column is separated by a space.

To create a data object you use the readr::read_table() function. The only required argument is file =.

You put the name of the file in quotes. And set the header argument to TRUE since the first row in the file is not data.

LH.df <- readr::read_table(file = "data/US.txt")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Year = col_double(),
##   All = col_double(),
##   MUS = col_double(),
##   G = col_double(),
##   FL = col_double(),
##   E = col_double()
## )

An data object called LH.df is now in your Environment under Data.

In this case the file name is simple because US.txt is in the same directory as your Rmd file.

Data files for an analysis are often kept somewhere else. Here for example note the folder called data? Click on the folder name. To read the data from that location you need to change file string name to "data/US.txt".

LH.df <- readr::read_table(file = "data/US.txt")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Year = col_double(),
##   All = col_double(),
##   MUS = col_double(),
##   G = col_double(),
##   FL = col_double(),
##   E = col_double()
## )

The file = argument is where R looks for your data.

If you get an error message it is likely because the data file is not where you think it is.

Note: No changes are made to your original data file.

If there are missing values in the data file they should be coded as NA. If they are coded as something else then you specify the coding with the na = argument. For example, if the missing value character in our file is coded as 99, you specify na = "99".

The readr::read_csv() has settings that are suitable for comma delimited (csv) files that have been exported from a spreadsheet.

A work flow might include exporting data from a spreadsheet using the csv file format then importing it to R using the readr::read_csv() function.

You import data from the web by specifying the URL instead of the local file name.

loc <- "http://myweb.fsu.edu/jelsner/temp/data/US.txt"
LH.df <- readr::read_table(file = loc)
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Year = col_double(),
##   All = col_double(),
##   MUS = col_double(),
##   G = col_double(),
##   FL = col_double(),
##   E = col_double()
## )

Recall that you reference the columns using the $ syntax. For example, type

LH.df$FL
##   [1] 1 2 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 2 0 2 1 0 1 2 1 0 3 0 2 0 0 0 3 1
##  [38] 2 0 0 0 0 1 2 0 3 1 1 1 0 1 0 1 0 0 2 0 0 1 1 1 0 0 0 1 2 1 0 1 0 1 0 0 2
##  [75] 1 2 0 2 1 0 0 0 2 2 2 1 0 0 1 0 1 2 0 1 2 1 2 2 1 2 0 0 1 0 0 1 0 0 0 1 0
## [112] 0 0 3 1 2 1 1 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 2 0 1 0 1 0 0 1 0 0 2 0 0 2
## [149] 1 0 0 0 0 4 3 0 0 0 0 0 0 0 0 0 0 1
sum(LH.df$FL)
## [1] 110

The number of years with 0, 1, 2, … Florida hurricanes is obtained by typing

table(LH.df$FL)
## 
##  0  1  2  3  4 
## 93 43 24  5  1

There are 93 years without a FL hurricane, 43 years with one hurricanes, 24 years with two hurricanes, and so on.

Creating structured data files

https://environmentalcomputing.net/getting-started-with-r/

Golden rules of data entry.

Convert unstructured data files (e.g., data stored in PDF forms) to structured data. https://www.youtube.com/watch?v=yBkHfIO8YJk

Data frames

The functions readr::read_table() and readr::read_csv() import data into our environment as a data frame. For example, LH.df is a data frame. You see the data object is a data frame in your Environment.

A data frame is like a spreadsheet. Values are arranged in rows and columns. Rows are the cases (observations) and columns are the variables.

The dim() function returns the size of the data frame in terms of how many rows (first number) and how many columns.

dim(LH.df)
## [1] 166   6

There are 166 rows and 6 columns in the data frame.

Note the use of inline code. Open with a single back tick (grave accent) followed by the letter r and close with a single back tick. Inline code allows content in your report to be dynamic. There is no need to retype values when the data changes. Open 05-Lesson.html in a browser.

To list the first six lines of the data object, type

head(LH.df)
## # A tibble: 6 × 6
##    Year   All   MUS     G    FL     E
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  1851     1     1     0     1     0
## 2  1852     3     1     1     2     0
## 3  1853     0     0     0     0     0
## 4  1854     2     1     1     0     1
## 5  1855     1     1     1     0     0
## 6  1856     2     1     1     1     0

The columns include year, number of hurricanes, number of major hurricanes, number of Gulf coast hurricanes, number of Florida hurricanes, and number of East coast hurricanes in order. Column names are printed as well.

The last six lines of the data frame are listed similarly using the tail() function. The number of lines listed is changed using the argument n =.

tail(LH.df, n = 3)
## # A tibble: 3 × 6
##    Year   All   MUS     G    FL     E
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  2014     1     0     0     0     1
## 2  2015     0     0     0     0     0
## 3  2016     2     0     0     1     1

The number of years in the record is assigned to the object nY and the annual average number of hurricanes (rate) is assigned to the object rate.

nY <- length(LH.df$All)
rate <- mean(LH.df$All)

By typing the names of the saved objects, the values are printed.

nY
## [1] 166
rate
## [1] 1.668675

Thus over the 166 years of data the average number of hurricanes per year is 1.67.

If you want to change the names of the columns in the data frame, type

names(LH.df)[4] <- "GC"
names(LH.df)
## [1] "Year" "All"  "MUS"  "GC"   "FL"   "E"

This changes the 4th column name from G to GC. Note that this change occurs to the data frame in R and not to your original data file.

You will work almost exclusively with data frames. A data frame has rows and columns.

  • Columns have names
  • Columns are vectors
  • Columns must be of the same length
  • Columns must be of the same data type

Each element is indexed by a row number and a column number in that order and separated by a comma. So if df is a data frame then df[2, 3] is the second row of the third column.

To print the second row of the first column of the data frame LH.df you type

LH.df[2, 1]
## # A tibble: 1 × 1
##    Year
##   <dbl>
## 1  1852

If you want all the values in a column, you leave the row number blank.

LH.df[ , 1]
## # A tibble: 166 × 1
##     Year
##    <dbl>
##  1  1851
##  2  1852
##  3  1853
##  4  1854
##  5  1855
##  6  1856
##  7  1857
##  8  1858
##  9  1859
## 10  1860
## # … with 156 more rows

You can also reference the column by name LH.df$Year.

Data frames have two indexes indicating the rows and columns in that order.

LH.df[10, 4]
## # A tibble: 1 × 1
##      GC
##   <dbl>
## 1     3
  • To a statistician a data frame is a table of observations. Each row contains one observation. Each observation must contain the same variables. These variables are called columns, and you can refer to them by name. You can also refer to the contents of the data frame by row number and column number (like a matrix).

  • To an Excel user a data frame is a worksheet (or a range within a worksheet). A data frame is more restrictive in that each column can only be of one data type (e.g., character, numeric, etc).

As an example, consider monthly precipitation from the state of Florida. Source: Monthly climate series. http://www.esrl.noaa.gov/psd/data/timeseries/. Get monthly precipitation values for the state back to the year 1895. Copy/paste into a text editor (notepad) then import using the readr::read_table() function.

Here I did it for Florida and put the file on my website. Missing values are coded as -9.900 so you add the argument na = "-9.900" to the function.

loc <- "http://myweb.fsu.edu/jelsner/temp/data/FLprecip.txt"
FLp.df <- readr::read_table(loc, na = "-9.900")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Year = col_double(),
##   Jan = col_double(),
##   Feb = col_double(),
##   Mar = col_double(),
##   Apr = col_double(),
##   May = col_double(),
##   Jun = col_double(),
##   Jul = col_double(),
##   Aug = col_double(),
##   Sep = col_double(),
##   Oct = col_double(),
##   Nov = col_double(),
##   Dec = col_double()
## )

Plot a time series graph.

library(ggplot2)

ggplot(data = FLp.df, aes(x = Year, y = Jan)) +
  geom_line() +
  ylab("Inches") +
  ggtitle(label = "January Precipitation in Florida",
          subtitle = "1895-2012")

A minimal, complete, reproducible example.

Quantiles

The median value cuts a set of ordered data values into two equal parts. Values larger than the median and values less than the median. The ordering comes from arranging the data from lowest to highest.

Quantiles cut a set of ordered data into arbitrary number of equal-sized parts. The quantile corresponding to cutting the data into two halves is called the median. Fifty percent of the data have values less than or equal to the median value. The median is the 50th percentile (.5 quantile).

Quantiles corresponding to cutting the ordered data into quarters are called quartiles. The lower (first) quartile cuts the data into the lower 25% and upper 75% of the data. The lower quartile is the .25 quantile or the 25th percentile indicating that 25% of the data have values less than this quantile value.

Correspondingly, the upper (third) quartile corresponding to the .75 quantile (75th percentile), indicates that 75% of the data have values less than this quantile value.

The quantile() function calculates quantiles on a vector of data. For example, consider Florida precipitation for the month of June. First apply the sort() function on the June values (column indicated by the label Jun).

sort(FLp.df$Jun)
##   [1]  2.303  2.445  3.292  3.643  3.673  3.898  3.908  4.089  4.202  4.401
##  [11]  4.500  4.598  4.739  4.747  4.820  4.838  4.965  5.098  5.099  5.160
##  [21]  5.182  5.221  5.321  5.349  5.362  5.422  5.440  5.531  5.588  5.602
##  [31]  5.607  5.614  5.696  5.718  5.724  5.752  5.803  5.866  5.887  5.896
##  [41]  5.931  5.971  5.998  6.142  6.147  6.171  6.220  6.258  6.269  6.281
##  [51]  6.351  6.392  6.392  6.470  6.540  6.541  6.591  6.739  6.789  6.900
##  [61]  6.991  6.998  7.002  7.009  7.012  7.049  7.057  7.098  7.118  7.208
##  [71]  7.306  7.348  7.450  7.451  7.481  7.666  7.707  7.748  7.876  8.000
##  [81]  8.040  8.158  8.168  8.243  8.317  8.378  8.389  8.432  8.488  8.578
##  [91]  8.663  8.874  8.880  8.940  8.969  8.976  9.106  9.308  9.349  9.481
## [101]  9.734  9.865  9.939  9.993 10.032 10.276 10.280 10.288 10.309 10.360
## [111] 10.529 10.858 11.014 11.228 11.824 12.034 12.379

Again, note the use of the dollar sign to indicate the column in the data frame.

To find the 50th percentile you use the median() function directly or the quantile() function and specify the quantile with the probs = argument.

median(FLp.df$Jun)
## [1] 6.789
quantile(FLp.df$Jun,
         probs = .5)
##   50% 
## 6.789

To retrieve the 25th and 75th percentile values

quantile(FLp.df$Jun, 
         probs = c(.25, .75))
##   25%   75% 
## 5.602 8.432

Of the 117 monthly precipitation values, 25% of them are less than 5.6 inches, 50% are less than 6.79 inches.

Thus there are an equal number of years with June precipitation between 5.6 and 6.79 inches.

The difference between the first and third quartile values is called the interquartile range (IQR). Fifty percent of the data values lie within the IQR. The IQR is obtained using the IQR() function.

Another example: Consider the set of North Atlantic Oscillation (NAO) index values for the month of June from the period 1851–2010. The NAO is a variation in the climate over the North Atlantic Ocean featuring fluctuations in the difference of atmospheric pressure at sea level between the Iceland and the Azores.

The index is computed as the difference in standardized sea-level pressures. The standardization is done by subtracting the mean and dividing by the standard deviation. The index has units of standard deviation.

First read the data consisting of monthly NAO values, then list the column names and the first few data lines.

loc <- "http://myweb.fsu.edu/jelsner/temp/data/NAO.txt"
NAO.df <- read.table(loc, 
                     header = TRUE)
head(NAO.df)
##   Year   Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov   Dec
## 1 1851  3.29  1.03  1.50 -1.66 -1.53 -1.62 -5.39  4.68  1.85  0.78 -1.77  1.74
## 2 1852  1.46  0.41 -2.50 -1.60  0.25  0.09 -1.13  2.94 -2.02 -1.65 -0.93  1.03
## 3 1853  1.31 -4.04 -0.32  0.76 -3.17  1.09  1.76 -2.36 -0.22 -0.47  0.51 -4.28
## 4 1854  1.28  1.72  2.67  0.88  0.04 -0.06 -1.92 -0.03  2.62  1.11 -1.56  2.42
## 5 1855 -1.84 -3.80 -0.05  0.99 -2.28  0.78 -2.61  3.81  0.79 -1.09 -2.42 -1.66
## 6 1856 -1.25 -0.10 -2.27  2.00 -0.70  2.03 -0.16 -0.44 -0.50  1.12 -1.69 -0.23

Determine the 5th and 95th percentile values for the month of June.

quantile(NAO.df$Jun, 
         prob = c(.05, .95))
##     5%    95% 
## -2.808  1.891

The summary() function provides summary statistics for each column in your data frame. The statistics include output the mean, median, minimum, maximum, along with the first quartile and third quartile values.

summary(FLp.df)
##       Year           Jan             Feb             Mar             Apr       
##  Min.   :1895   Min.   :0.340   Min.   :0.288   Min.   :0.496   Min.   :0.408  
##  1st Qu.:1924   1st Qu.:1.798   1st Qu.:2.009   1st Qu.:2.142   1st Qu.:1.659  
##  Median :1953   Median :2.696   Median :3.099   Median :3.349   Median :2.677  
##  Mean   :1953   Mean   :2.916   Mean   :3.164   Mean   :3.663   Mean   :2.926  
##  3rd Qu.:1982   3rd Qu.:4.010   3rd Qu.:4.171   3rd Qu.:5.097   3rd Qu.:4.163  
##  Max.   :2011   Max.   :8.361   Max.   :8.577   Max.   :8.701   Max.   :7.457  
##       May             Jun              Jul              Aug        
##  Min.   :0.900   Min.   : 2.303   Min.   : 4.050   Min.   : 4.053  
##  1st Qu.:2.483   1st Qu.: 5.602   1st Qu.: 6.427   1st Qu.: 6.164  
##  Median :3.758   Median : 6.789   Median : 7.522   Median : 7.102  
##  Mean   :3.845   Mean   : 7.046   Mean   : 7.505   Mean   : 7.345  
##  3rd Qu.:4.765   3rd Qu.: 8.432   3rd Qu.: 8.358   3rd Qu.: 8.310  
##  Max.   :9.848   Max.   :12.379   Max.   :11.263   Max.   :13.090  
##       Sep              Oct             Nov             Dec       
##  Min.   : 2.126   Min.   :0.471   Min.   :0.370   Min.   :0.610  
##  1st Qu.: 4.930   1st Qu.:2.479   1st Qu.:1.370   1st Qu.:1.549  
##  Median : 6.680   Median :3.541   Median :2.139   Median :2.558  
##  Mean   : 6.704   Mean   :3.803   Mean   :2.308   Mean   :2.718  
##  3rd Qu.: 7.955   3rd Qu.:4.899   3rd Qu.:3.110   3rd Qu.:3.521  
##  Max.   :12.978   Max.   :9.556   Max.   :6.236   Max.   :7.668

Columns with missing values get a row output from the summary() function indicating the number of them (NA’s).

Creating a data frame

The data.frame() function creates a data frame from a set of vectors.

Consider ice volume (10\(^3\) km\(^3\)) measurements from the arctic from 2002 to 2012. The measurements are taken on January 1st each year and are available from http://psc.apl.washington.edu/wordpress/research/projects/arctic-sea-ice-volume-anomaly/data/

Volume <- c(20.233, 19.659, 18.597, 18.948, 17.820, 
           16.736, 16.648, 17.068, 15.916, 14.455, 
           14.569)

Since the data have a sequential order you create a data frame with year in the first column and volume in the second.

Year <- 2002:2012
Ice.df <- data.frame(Year, Volume)
head(Ice.df)
##   Year Volume
## 1 2002 20.233
## 2 2003 19.659
## 3 2004 18.597
## 4 2005 18.948
## 5 2006 17.820
## 6 2007 16.736

What year had the minimum ice volume?

which.min(Ice.df$Volume)
## [1] 10
Ice.df[10, ]
##    Year Volume
## 10 2011 14.455
Ice.df$Year[which.min(Ice.df$Volume)]
## [1] 2011

To change a vector to a data frame use the function as.data.frame(). For example, let counts be a vector of integers.

counts <- rpois(n = 100, 
                lambda = 1.66)
head(counts)
## [1] 1 2 2 3 0 3
H.df <- as.data.frame(counts)
head(H.df)
##   counts
## 1      1
## 2      2
## 3      2
## 4      3
## 5      0
## 6      3

The column name in the data frame is the name of the vector.

Pipes

So far you have computed statistics on data stored as vectors (mean, median, quantiles, etc). But you often import data as data frames so you need to know how to manipulate them.

The {dplyr} package has functions (‘verbs’) that manipulate data frames in a friendly and logical way. Manipulations include, selecting columns, filtering rows, re-ordering rows, adding new columns, and summarizing data.

library(dplyr)

Let’s look at these using the airquality data frame. Recall the object airquality is a data frame containing New York air quality measurements from May to September 1973. (?airquality).

head(airquality)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6
dim(airquality)
## [1] 153   6

The columns include Ozone (ozone concentration in ppb), Solar.R (solar radiation in langleys), Wind (wind speed in mph), Temp (air temperature in degrees F), Month, and Day.

You summarize the values in each column with the summary() method.

summary(airquality)
##      Ozone           Solar.R           Wind             Temp      
##  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
##  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
##  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
##  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
##  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
##  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
##  NA's   :37       NA's   :7                                       
##      Month            Day      
##  Min.   :5.000   Min.   : 1.0  
##  1st Qu.:6.000   1st Qu.: 8.0  
##  Median :7.000   Median :16.0  
##  Mean   :6.993   Mean   :15.8  
##  3rd Qu.:8.000   3rd Qu.:23.0  
##  Max.   :9.000   Max.   :31.0  
## 

Note that columns that have missing values are tabulated. For example, there are 37 missing ozone measurements and 7 missing radiation measurements.

Importantly you can apply the summary() function using the pipe operator (|> or %>%). The pipe operator is part of the {dplyr} package.

airquality |> 
  summary()
##      Ozone           Solar.R           Wind             Temp      
##  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
##  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
##  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
##  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
##  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
##  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
##  NA's   :37       NA's   :7                                       
##      Month            Day      
##  Min.   :5.000   Min.   : 1.0  
##  1st Qu.:6.000   1st Qu.: 8.0  
##  Median :7.000   Median :16.0  
##  Mean   :6.993   Mean   :15.8  
##  3rd Qu.:8.000   3rd Qu.:23.0  
##  Max.   :9.000   Max.   :31.0  
## 

You read the pipe as THEN. “airquality data frame THEN summarize.”

The pipe operator allows us to string together a bunch of functions that makes it easy for humans to understand what was done. This is a key point. You want your code to be readable by a computer (correct syntax) but also readable to other humans.

For example, suppose the object of interest is called me and suppose there is a function called wake_up(). I could apply the function in two ways.

wake_up(me)
me |> 
  wake_up()

The second way involves a bit more typing but it is easier for a human to read and thus it is easier to understand. This becomes clear when stringing together many functions.

For example, what happens to the result of me after the function wake_up() has been applied? How about get_out_of_bed() and the get_dressed()? Again, I can apply these functions in two ways.

get_dressed(get_out_of_bed(wake_up(me)))

me |>
  wake_up() |>
  get_out_of_bed() |>
  get_dressed()

Continuing

me |>
  wake_up() |>
  get_out_of_bed() |>
  get_dressed() |>
  make_coffee() |>
  drink_coffee() |>
  leave_house()

Which is much better in terms of ‘readability’ then leave_house(drink_coffee(make_coffee(get_dressed(get_out_of_bed(wake_up(me)))))).

Consider again the FLp.df. How would you use the above syntax to compute the mean value of June precipitation?

You ask three questions: what function, applied to what variable, from what data frame? Answers: mean(), Jun, FLp.df. You then write the code starting with the answer to the last question first.

FLp.df |>
  pull(Jun)
##   [1]  4.500 11.228  5.221  3.292  5.803  9.993 10.360  6.220  7.012  6.591
##  [11]  5.160  8.040  6.392  6.351  6.739 10.288  4.820 12.379  5.531  4.202
##  [21]  5.321  6.541  5.362  5.349  7.481  6.258  3.673  6.540  9.308  6.470
##  [31]  6.281  8.168  7.450  7.057  8.158 10.858  2.303  8.378  5.182  9.865
##  [41]  5.099  8.940  5.931  6.998  9.734  7.049  7.707 10.529  7.348  5.607
##  [51]  8.578  7.098  9.106  3.908  8.000  4.089  4.747  3.643  7.876  5.588
##  [61]  6.392  5.422  7.748  6.147  8.389  6.789  5.896  8.317  7.118  5.614
##  [71] 10.032  8.880  8.488  9.939  6.142  5.866  5.602  8.432  5.887 10.276
##  [81]  6.269  7.002  4.401  6.900  3.898  4.838  5.718 10.280  8.969  5.098
##  [91]  7.009  7.451  5.696  4.739  8.976  5.724  7.666 12.034  4.598  9.349
## [101]  8.874  7.306  7.208  2.445  9.481  5.971  8.663 10.309 11.014  8.243
## [111] 11.824  5.752  5.998  6.991  6.171  5.440  4.965

The function pull() from the {dplyr} packages pulls out the column named Jun as a vector.

Then the mean() function takes these 118 values and computes the average.

FLp.df |>
  pull(Jun) |>
  mean()
## [1] 7.045692

Note that the next function in the sequence receives the output from the previous function as its FIRST argument so the function mean() has nothing inside the parentheses.

Your turn

  1. Use the piping operator and compute the average wind speed in the airquality data frame.
  2. Use the piping operator and compute the 10th and 90th percentiles (lower and upper decile values) of May precipitation in Florida.

Thursday, September 8, 2022

Today

  • Pipe operator
  • Wrangling data

Data wrangling (munging) is the process of transforming data from one format into another to make it easier to interpret it.

The {dplyr} package includes functions that wrangle data frames in a logical way. Key idea: The functions operate on data frames and return data frames.

Operations include selecting columns, filtering rows, re-ordering rows, adding new columns, and summarizing data.

library(dplyr)

Recall the object airquality is a data frame containing New York air quality measurements from May to September 1973. (?airquality).

You get a statistical summary of the values in each column with the summary() method.

summary(airquality)
##      Ozone           Solar.R           Wind             Temp      
##  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
##  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
##  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
##  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
##  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
##  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
##  NA's   :37       NA's   :7                                       
##      Month            Day      
##  Min.   :5.000   Min.   : 1.0  
##  1st Qu.:6.000   1st Qu.: 8.0  
##  Median :7.000   Median :16.0  
##  Mean   :6.993   Mean   :15.8  
##  3rd Qu.:8.000   3rd Qu.:23.0  
##  Max.   :9.000   Max.   :31.0  
## 

Pipe operator

Importantly you can apply the summary() function using the pipe operator (|>). The pipe operator is part of the {dplyr} package and when used together with the wrangling functions, it provides a easy way to make code easy to read.

For example, you read the pipe as THEN. “airquality data frame THEN summarize.”

airquality |> 
  summary()
##      Ozone           Solar.R           Wind             Temp      
##  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
##  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
##  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
##  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
##  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
##  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
##  NA's   :37       NA's   :7                                       
##      Month            Day      
##  Min.   :5.000   Min.   : 1.0  
##  1st Qu.:6.000   1st Qu.: 8.0  
##  Median :7.000   Median :16.0  
##  Mean   :6.993   Mean   :15.8  
##  3rd Qu.:8.000   3rd Qu.:23.0  
##  Max.   :9.000   Max.   :31.0  
## 

The pipe operator allows us to string together functions while keeping the code readable. You want your code to be machine readable (correct syntax) but also human readable.

For example, suppose the object of interest is called me and suppose there is a function called wake_up(). I can apply the function in two ways.

wake_up(me)
me |> 
  wake_up()

The second way involves a bit more typing but it is easier for someone to read and thus it is easier to understand. This becomes clear when stringing together many functions.

For example, what happens to the result of me after the function wake_up() has been applied? How about get_out_of_bed() and then get_dressed()? I can apply these functions in two ways.

get_dressed(get_out_of_bed(wake_up(me)))

me |>
  wake_up() |>
  get_out_of_bed() |>
  get_dressed()

Continuing

me |>
  wake_up() |>
  get_out_of_bed() |>
  get_dressed() |>
  make_coffee() |>
  drink_coffee() |>
  leave_house()

Which is much better in terms of ‘readability’ then leave_house(drink_coffee(make_coffee(get_dressed(get_out_of_bed(wake_up(me)))))).

Consider again the FLp.df.

loc <- "http://myweb.fsu.edu/jelsner/temp/data/FLprecip.txt"
FLp.df <- read.table(loc, 
                     header = TRUE,
                     na.string = "-9.900")

How would you use the above readable syntax to compute the mean value of June precipitation?

You ask three questions: what function, applied to what variable, from what data frame? Answers: mean(), Jun, FLp.df. You then write the code starting with the answer to the last question first.

FLp.df |>
  pull(Jun)
##   [1]  4.500 11.228  5.221  3.292  5.803  9.993 10.360  6.220  7.012  6.591
##  [11]  5.160  8.040  6.392  6.351  6.739 10.288  4.820 12.379  5.531  4.202
##  [21]  5.321  6.541  5.362  5.349  7.481  6.258  3.673  6.540  9.308  6.470
##  [31]  6.281  8.168  7.450  7.057  8.158 10.858  2.303  8.378  5.182  9.865
##  [41]  5.099  8.940  5.931  6.998  9.734  7.049  7.707 10.529  7.348  5.607
##  [51]  8.578  7.098  9.106  3.908  8.000  4.089  4.747  3.643  7.876  5.588
##  [61]  6.392  5.422  7.748  6.147  8.389  6.789  5.896  8.317  7.118  5.614
##  [71] 10.032  8.880  8.488  9.939  6.142  5.866  5.602  8.432  5.887 10.276
##  [81]  6.269  7.002  4.401  6.900  3.898  4.838  5.718 10.280  8.969  5.098
##  [91]  7.009  7.451  5.696  4.739  8.976  5.724  7.666 12.034  4.598  9.349
## [101]  8.874  7.306  7.208  2.445  9.481  5.971  8.663 10.309 11.014  8.243
## [111] 11.824  5.752  5.998  6.991  6.171  5.440  4.965

The function pull() from the {dplyr} packages pulls out the column named Jun and returns a vector of the values.

Then the mean() function takes these 118 values and computes the average.

FLp.df |>
  pull(Jun) |>
  mean()
## [1] 7.045692

IMPORTANT: the next function in the sequence receives the output from the previous function as its FIRST argument so the function mean() has nothing inside the parentheses.

  1. Use the piping operator and compute the average wind speed in the airquality data frame.
airquality |>
  pull(Wind) |>
  mean()
## [1] 9.957516
  1. Use the piping operator and compute the 10th and 90th percentiles (lower and upper decile values) of May precipitation in Florida.
FLp.df |>
  pull(May) |>
  quantile(probs = c(.1, .9))
##    10%    90% 
## 1.7954 6.0828

Wrangling data frames

You will wrangle data with functions from the {dplyr} package. The functions work on data frames but they work better if the data frame is a tibble. Tibbles are data frames that make life a little easier.

R is an old language, and some things that were useful 10 or 20 years ago now get in the way. To make a data frame a tibble (tabular data frame) type

airquality <- as_tibble(airquality)
class(airquality)
## [1] "tbl_df"     "tbl"        "data.frame"

Click on airquality in the environment. It is a data frame.

Selecting and filtering

The function select() chooses variables by name to create a data frame with fewer columns. For example, choose the month, day, and temperature columns from the airquality data frame.

airquality |>
  dplyr::select(Month, Day, Temp)
## # A tibble: 153 × 3
##    Month   Day  Temp
##    <int> <int> <int>
##  1     5     1    67
##  2     5     2    72
##  3     5     3    74
##  4     5     4    62
##  5     5     5    56
##  6     5     6    66
##  7     5     7    65
##  8     5     8    59
##  9     5     9    61
## 10     5    10    69
## # … with 143 more rows

Suppose you want a new data frame with only the temperature and ozone concentrations.

df <- airquality |>
        dplyr::select(Temp, Ozone)
df
## # A tibble: 153 × 2
##     Temp Ozone
##    <int> <int>
##  1    67    41
##  2    72    36
##  3    74    12
##  4    62    18
##  5    56    NA
##  6    66    28
##  7    65    23
##  8    59    19
##  9    61     8
## 10    69    NA
## # … with 143 more rows

You include an assignment operator (<-, left pointing arrow) and an object name (here df).

Note: The result of applying most {dplyr} verbs is a data frame. The take only data frames and return only data frames.

The function filter() chooses observations based on specific values. filter

Suppose you want only the observations where the temperature is at or above 80F.

airquality |>
  dplyr::filter(Temp >= 80)
## # A tibble: 73 × 6
##    Ozone Solar.R  Wind  Temp Month   Day
##    <int>   <int> <dbl> <int> <int> <int>
##  1    45     252  14.9    81     5    29
##  2    NA     186   9.2    84     6     4
##  3    NA     220   8.6    85     6     5
##  4    29     127   9.7    82     6     7
##  5    NA     273   6.9    87     6     8
##  6    71     291  13.8    90     6     9
##  7    39     323  11.5    87     6    10
##  8    NA     259  10.9    93     6    11
##  9    NA     250   9.2    92     6    12
## 10    23     148   8      82     6    13
## # … with 63 more rows

The result is a data frame with the same 6 columns but now only 73 observations. Each of the observations has a temperature of at least 80F.

Suppose you want a new data frame keeping only observations where temperature is at least 80F AND winds less than 5 mph.

df <- airquality |> 
  dplyr::filter(Temp >= 80 & Wind < 5)
df
## # A tibble: 8 × 6
##   Ozone Solar.R  Wind  Temp Month   Day
##   <int>   <int> <dbl> <int> <int> <int>
## 1   135     269   4.1    84     7     1
## 2    64     175   4.6    83     7     5
## 3    66      NA   4.6    87     8     6
## 4   122     255   4      89     8     7
## 5   168     238   3.4    81     8    25
## 6   118     225   2.3    94     8    29
## 7    73     183   2.8    93     9     3
## 8    91     189   4.6    93     9     4

Example: Palmer penguins

Let’s return to the penguins data set. The data set is located on the web, and you import it as a data frame using the readr::read_csv() function.

loc <- "https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv"
penguins <- readr::read_csv(loc)
## Rows: 344 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): species, island, sex
## dbl (5): bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, year
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
penguins
## # A tibble: 344 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <chr>   <chr>              <dbl>         <dbl>             <dbl>       <dbl>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # … with 334 more rows, and 2 more variables: sex <chr>, year <dbl>

To keep only the penguins labeled in the column sex as female type

penguins |> 
  dplyr::filter(sex == "female")
## # A tibble: 165 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <chr>   <chr>              <dbl>         <dbl>             <dbl>       <dbl>
##  1 Adelie  Torgersen           39.5          17.4               186        3800
##  2 Adelie  Torgersen           40.3          18                 195        3250
##  3 Adelie  Torgersen           36.7          19.3               193        3450
##  4 Adelie  Torgersen           38.9          17.8               181        3625
##  5 Adelie  Torgersen           41.1          17.6               182        3200
##  6 Adelie  Torgersen           36.6          17.8               185        3700
##  7 Adelie  Torgersen           38.7          19                 195        3450
##  8 Adelie  Torgersen           34.4          18.4               184        3325
##  9 Adelie  Biscoe              37.8          18.3               174        3400
## 10 Adelie  Biscoe              35.9          19.2               189        3800
## # … with 155 more rows, and 2 more variables: sex <chr>, year <dbl>

To filter rows keeping only species that are not Adalie penguins.

penguins |> 
  dplyr::filter(species != "Adelie")
## # A tibble: 192 × 8
##    species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <chr>   <chr>           <dbl>         <dbl>             <dbl>       <dbl>
##  1 Gentoo  Biscoe           46.1          13.2               211        4500
##  2 Gentoo  Biscoe           50            16.3               230        5700
##  3 Gentoo  Biscoe           48.7          14.1               210        4450
##  4 Gentoo  Biscoe           50            15.2               218        5700
##  5 Gentoo  Biscoe           47.6          14.5               215        5400
##  6 Gentoo  Biscoe           46.5          13.5               210        4550
##  7 Gentoo  Biscoe           45.4          14.6               211        4800
##  8 Gentoo  Biscoe           46.7          15.3               219        5200
##  9 Gentoo  Biscoe           43.3          13.4               209        4400
## 10 Gentoo  Biscoe           46.8          15.4               215        5150
## # … with 182 more rows, and 2 more variables: sex <chr>, year <dbl>

When the column of interest is a numerical, you can filter rows by using greater than condition. For example, to create a data frame containing the heaviest penguins you filter keeping only rows with body mass greater than 6000 g.

penguins |> 
  dplyr::filter(body_mass_g > 6000)
## # A tibble: 2 × 8
##   species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex  
##   <chr>   <chr>           <dbl>         <dbl>            <dbl>       <dbl> <chr>
## 1 Gentoo  Biscoe           49.2          15.2              221        6300 male 
## 2 Gentoo  Biscoe           59.6          17                230        6050 male 
## # … with 1 more variable: year <dbl>

You can also filter rows of a data frame with less than condition. For example, to create a data frame containing only penguins with short flippers you filter keeping only rows with flipper length less than 175 mm.

penguins |> 
  dplyr::filter(flipper_length_mm < 175)
## # A tibble: 2 × 8
##   species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex  
##   <chr>   <chr>           <dbl>         <dbl>            <dbl>       <dbl> <chr>
## 1 Adelie  Biscoe           37.8          18.3              174        3400 fema…
## 2 Adelie  Biscoe           37.9          18.6              172        3150 fema…
## # … with 1 more variable: year <dbl>

You can also specify more than one conditions. For example to create a data frame with female penguins that have larger flippers you filter keeping only rows with flipper length greater than 220 mm and with sex equal to female.

penguins |> 
  dplyr::filter(flipper_length_mm > 220 & 
                sex == "female")
## # A tibble: 1 × 8
##   species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex  
##   <chr>   <chr>           <dbl>         <dbl>            <dbl>       <dbl> <chr>
## 1 Gentoo  Biscoe           46.9          14.6              222        4875 fema…
## # … with 1 more variable: year <dbl>

You can also filter a data frame for rows satisfying one of the two conditions using OR. For example to create a data frame with penguins have large flippers or short bills you filter keeping rows with flipper length of at least 220 mm or with bill depth less than 10 mm.

penguins |> 
  dplyr::filter(flipper_length_mm > 220 | 
                bill_depth_mm < 10)
## # A tibble: 35 × 8
##    species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <chr>   <chr>           <dbl>         <dbl>             <dbl>       <dbl>
##  1 Gentoo  Biscoe           50            16.3               230        5700
##  2 Gentoo  Biscoe           49.2          15.2               221        6300
##  3 Gentoo  Biscoe           48.7          15.1               222        5350
##  4 Gentoo  Biscoe           47.3          15.3               222        5250
##  5 Gentoo  Biscoe           59.6          17                 230        6050
##  6 Gentoo  Biscoe           49.6          16                 225        5700
##  7 Gentoo  Biscoe           50.5          15.9               222        5550
##  8 Gentoo  Biscoe           50.5          15.9               225        5400
##  9 Gentoo  Biscoe           50.1          15                 225        5000
## 10 Gentoo  Biscoe           50.4          15.3               224        5550
## # … with 25 more rows, and 2 more variables: sex <chr>, year <dbl>

Often you want to remove rows if one of the columns has a missing value. With is.na() on the column of interest, you can filter rows based on whether or not a column value is missing.

Note the is.na() function returns a vector of TRUEs and FALSEs

is.na(airquality$Ozone)
##   [1] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
##  [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [25]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [37]  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE
##  [49] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [61]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
##  [73] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
##  [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [97] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE
## [109] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE
## [121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [145] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE

The first four rows of the vector Ozone in the airquality data frame are not missing so the function is.na() returns four FALSEs.

When you combine that with the filter() function you get a data frame containing all the rows where is.na() returns a TRUE. For example, create a data frame containing rows where the bill length value is missing.

penguins |> 
  dplyr::filter(is.na(bill_length_mm))
## # A tibble: 2 × 8
##   species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex  
##   <chr>   <chr>           <dbl>         <dbl>            <dbl>       <dbl> <chr>
## 1 Adelie  Torge…             NA            NA               NA          NA <NA> 
## 2 Gentoo  Biscoe             NA            NA               NA          NA <NA> 
## # … with 1 more variable: year <dbl>

Usually you will want to do the reverse of this. That is keep all the rows where the column value is not missing. In this case use negation symbol ! to reverse the selection. In this example, filter rows with no missing values for sex column.

penguins |> 
  dplyr::filter(!is.na(sex))
## # A tibble: 333 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <chr>   <chr>              <dbl>         <dbl>             <dbl>       <dbl>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           36.7          19.3               193        3450
##  5 Adelie  Torgersen           39.3          20.6               190        3650
##  6 Adelie  Torgersen           38.9          17.8               181        3625
##  7 Adelie  Torgersen           39.2          19.6               195        4675
##  8 Adelie  Torgersen           41.1          17.6               182        3200
##  9 Adelie  Torgersen           38.6          21.2               191        3800
## 10 Adelie  Torgersen           34.6          21.1               198        4400
## # … with 323 more rows, and 2 more variables: sex <chr>, year <dbl>

Note that this filtering will keep rows with other column values that are missing values but there will be no penguins where the sex value is NA.

Stringing functions together

The function arrange() orders the rows by values given in a particular column.

airquality |>
  dplyr::arrange(Solar.R)
## # A tibble: 153 × 6
##    Ozone Solar.R  Wind  Temp Month   Day
##    <int>   <int> <dbl> <int> <int> <int>
##  1    16       7   6.9    74     7    21
##  2     1       8   9.7    59     5    21
##  3    23      13  12      67     5    28
##  4    23      14   9.2    71     9    22
##  5     8      19  20.1    61     5     9
##  6    14      20  16.6    63     9    25
##  7     9      24  13.8    81     8     2
##  8     9      24  10.9    71     9    14
##  9     4      25   9.7    61     5    23
## 10    13      27  10.3    76     9    18
## # … with 143 more rows

The ordering is from lowest value to highest value. Here the first 10 rows. Note Month and Day are no longer chronological.

Repeat but order by the value of air temperature.

airquality |>
  dplyr::arrange(Temp)
## # A tibble: 153 × 6
##    Ozone Solar.R  Wind  Temp Month   Day
##    <int>   <int> <dbl> <int> <int> <int>
##  1    NA      NA  14.3    56     5     5
##  2     6      78  18.4    57     5    18
##  3    NA      66  16.6    57     5    25
##  4    NA      NA   8      57     5    27
##  5    18      65  13.2    58     5    15
##  6    NA     266  14.9    58     5    26
##  7    19      99  13.8    59     5     8
##  8     1       8   9.7    59     5    21
##  9     8      19  20.1    61     5     9
## 10     4      25   9.7    61     5    23
## # … with 143 more rows

Importantly you can string the functions together. For example select the variables radiation, wind, and temperature then filter by temperatures above 90F and arrange from coolest to warmest by temperature.

airquality |>
  dplyr::select(Solar.R, Wind, Temp) |>
  dplyr::filter(Temp > 90) |>
  dplyr::arrange(Temp)
## # A tibble: 14 × 3
##    Solar.R  Wind  Temp
##      <int> <dbl> <int>
##  1     291  14.9    91
##  2     167   6.9    91
##  3     250   9.2    92
##  4     267   6.3    92
##  5     272   5.7    92
##  6     222   8.6    92
##  7     197   5.1    92
##  8     259  10.9    93
##  9     183   2.8    93
## 10     189   4.6    93
## 11     225   2.3    94
## 12     188   6.3    94
## 13     237   6.3    96
## 14     203   9.7    97

The result is a data frame with three columns and 14 rows arranged by increasing temperatures above 90F.

The mutate() function adds new columns to the data frame. mutate

For example, create a new column called TempC as the temperature in degrees Celcius. Also create a column called WindMS as the wind speed in meters per second.

airquality |>
  dplyr::mutate(TempC = (Temp - 32) * 5/9,
                WindMS = Wind * .44704) 
## # A tibble: 153 × 8
##    Ozone Solar.R  Wind  Temp Month   Day TempC WindMS
##    <int>   <int> <dbl> <int> <int> <int> <dbl>  <dbl>
##  1    41     190   7.4    67     5     1  19.4   3.31
##  2    36     118   8      72     5     2  22.2   3.58
##  3    12     149  12.6    74     5     3  23.3   5.63
##  4    18     313  11.5    62     5     4  16.7   5.14
##  5    NA      NA  14.3    56     5     5  13.3   6.39
##  6    28      NA  14.9    66     5     6  18.9   6.66
##  7    23     299   8.6    65     5     7  18.3   3.84
##  8    19      99  13.8    59     5     8  15     6.17
##  9     8      19  20.1    61     5     9  16.1   8.99
## 10    NA     194   8.6    69     5    10  20.6   3.84
## # … with 143 more rows

The resulting data frame has 8 columns (two new ones) labeled TempC and WindMS.

On days when the temperature is below 60 F add a column giving the apparent temperature based on the cooling effect of the wind (wind chill) and then arrange from coldest to warmest apparent temperature.

airquality |>
  dplyr::filter(Temp < 60) |>
  dplyr::mutate(TempAp = 35.74 + .6215 * Temp - 35.75 * Wind^.16 + .4275 * Temp * Wind^.16) |>
  dplyr::arrange(TempAp)
## # A tibble: 8 × 7
##   Ozone Solar.R  Wind  Temp Month   Day TempAp
##   <int>   <int> <dbl> <int> <int> <int>  <dbl>
## 1    NA      NA  14.3    56     5     5   52.5
## 2     6      78  18.4    57     5    18   53.0
## 3    NA      66  16.6    57     5    25   53.3
## 4    NA     266  14.9    58     5    26   54.9
## 5    18      65  13.2    58     5    15   55.2
## 6    NA      NA   8      57     5    27   55.3
## 7    19      99  13.8    59     5     8   56.4
## 8     1       8   9.7    59     5    21   57.3

Summarize

The summarize() function reduces (flattens) the data frame based on a function that computes a statistic. For example, to compute the average wind speed during July type

airquality |>
  dplyr::filter(Month == 7) |>
  dplyr::summarize(Wavg = mean(Wind))
## # A tibble: 1 × 1
##    Wavg
##   <dbl>
## 1  8.94
airquality |>
  dplyr::filter(Month == 6) |>
  dplyr::summarize(Tavg = mean(Temp))
## # A tibble: 1 × 1
##    Tavg
##   <dbl>
## 1  79.1

We have seen functions that compute statistics on vectors including sum(), sd(), min(), max(), var(), range(), median(). Others include

Summary function Description
dplyr::n() Length of the column
dplyr::first() First value of the column
dplyr::last() Last value of the column
dplyr::n_distinct() Number of distinct values

Find the maximum and median wind speed and maximum ozone concentration values during the month of May. Also determine the number of observations during May.

airquality |>
  dplyr::filter(Month == 5) |>
  dplyr::summarize(Wmax = max(Wind), 
            Wmed = median(Wind), 
            OzoneMax = max(Ozone, na.rm = TRUE), 
            NumDays = dplyr::n())
## # A tibble: 1 × 4
##    Wmax  Wmed OzoneMax NumDays
##   <dbl> <dbl>    <int>   <int>
## 1  20.1  11.5      115      31

Why do you get an NA for OzoneMax?

Fix this by including the argument na.rm = TRUE inside the max() function.

airquality |>
  dplyr::filter(Month == 5) |>
  dplyr::summarize(Wmax = max(Wind),
            Wmed = median(Wind),
            OzoneMax = max(Ozone, na.rm = TRUE),
            NumDays = dplyr::n())
## # A tibble: 1 × 4
##    Wmax  Wmed OzoneMax NumDays
##   <dbl> <dbl>    <int>   <int>
## 1  20.1  11.5      115      31

Grouping

If you want to summarize separately for each month you use the group_by() function. You split the data frame by some variable (e.g., Month), apply a function to the individual data frames, and then combine the output.

Find the highest ozone concentration by month. Include the number of observations (days) in the month.

airquality |>
  dplyr::group_by(Month) |>
  dplyr::summarize(OzoneMax =  max(Ozone, na.rm = TRUE),
            NumDays = dplyr::n())
## # A tibble: 5 × 3
##   Month OzoneMax NumDays
##   <int>    <int>   <int>
## 1     5      115      31
## 2     6       71      30
## 3     7      135      31
## 4     8      168      31
## 5     9       96      30

Find the average ozone concentration when temperatures are above and below 70 F. Include the number of observations (days) in the two groups.

airquality |>
  dplyr::group_by(Temp >= 70) |>
  dplyr::summarize(OzoneAvg =  mean(Ozone, na.rm = TRUE),
            NumDays = dplyr::n())
## # A tibble: 2 × 3
##   `Temp >= 70` OzoneAvg NumDays
##   <lgl>           <dbl>   <int>
## 1 FALSE            18.0      32
## 2 TRUE             49.1     121

On average ozone concentration is higher on warm days (Temp >= 70 F) days. Said another way; mean ozone concentration statistically depends on temperature.

The mean is a model for the data. The statistical dependency of the mean implies that a model for ozone concentration will likely be improved by including temperature as an explanatory variable.

To summarize, the important verbs are

Verb Description
dplyr::select() selects columns; pick variables by their names
dplyr::filter() filters rows; pick observations by their values
dplyr::mutate() creates new columns; create new variables with functions of existing variables
dplyr::summarize() summarizes values; collapse many values down to a single summary
dplyr::group_by() allows operations to be grouped

The syntax of the verb functions are all the same:

Properties * The first argument is a data frame. This argument is implicit when using the |> operator. * The subsequent arguments describe what to do with the data frame. You refer to columns in the data frame directly (without using $). * The result is a new data frame

These properties make it easy to chain together many simple lines of code to do something complex.

The five functions form the basis of a grammar for data. At the most basic level, you can only alter a data frame in five useful ways: you can reorder the rows (arrange()), pick observations and variables of interest (filter() and select()), add new variables that are functions of existing variables (mutate()), or collapse many values to a summary (summarise()).

Your turn

Consider again the Florida precipitation data set (http://myweb.fsu.edu/jelsner/temp/data/FLprecip.txt). Import the data as a data frame, select the columns April and Year, group by years > 1960, then compute the mean and variance of the April rainfall with the summarize() function.

Tuesday, September 12, 2022

Today

  • Examples of data munging with functions from the {dplyr} package

You work with data frames. The functions are verbs. The verbs include:

Verb Description
dplyr::select() selects columns; pick variables by their names
dplyr::filter() filters rows; pick observations by their values
dplyr::arrange() reorders rows
dplyr::mutate() creates new columns; create new variables with functions of existing variables
dplyr::summarize() summarizes values; collapse many values down to a single summary
dplyr::group_by() allows operations to be grouped

Syntax for the verb functions are the same:

Properties * The first argument is a data frame. This argument is implied when using the |> (pipe) operator (also %>%). * The subsequent arguments describe what to do with the data frame. You refer to columns in the data frame directly (without using $). * The result is a new data frame

The properties make it easy to chain together simple lines of code to do something complex.

The five functions form the basis of a grammar for data. At the most basic level, you can alter a data frame in five useful ways: you can reorder the rows (arrange()), pick observations and variables of interest (filter() and select()), add new variables that are functions of existing variables (mutate()), or collapse many values to a summary (summarise()).

As a review consider again the Florida precipitation data set (http://myweb.fsu.edu/jelsner/temp/data/FLprecip.txt). Import the data as a data frame, select the columns April and Year, group by years > 1960, then summarize by computing the mean and variance of the April rainfall.

FLp.df <- readr::read_table(file = "http://myweb.fsu.edu/jelsner/temp/data/FLprecip.txt")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Year = col_double(),
##   Jan = col_double(),
##   Feb = col_double(),
##   Mar = col_double(),
##   Apr = col_double(),
##   May = col_double(),
##   Jun = col_double(),
##   Jul = col_double(),
##   Aug = col_double(),
##   Sep = col_double(),
##   Oct = col_double(),
##   Nov = col_double(),
##   Dec = col_double()
## )
FLp.df |>
  dplyr::select(Apr, Year) |>
  dplyr::group_by(Year > 1960) |>
  dplyr::summarize(Avg = mean(Apr),
                   Var = var(Apr))
## # A tibble: 2 × 3
##   `Year > 1960`   Avg   Var
##   <lgl>         <dbl> <dbl>
## 1 FALSE          3.14  2.61
## 2 TRUE           2.66  2.07

Example 1: New York City flight data

Let’s consider the flights data frame from the package {nycflights13}.

library(nycflights13)
dim(flights)
## [1] 336776     19

The data contains all 336,776 flights that departed NYC in 2013 and comes from the U.S. Bureau of Transportation Statistics. More information is available by typing ?nycflights13.

The object flights is a tibble (tabled data frame). When we have a large data frame it is useful to make it a tibble.

head(flights)
## # A tibble: 6 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     1      517            515         2      830            819
## 2  2013     1     1      533            529         4      850            830
## 3  2013     1     1      542            540         2      923            850
## 4  2013     1     1      544            545        -1     1004           1022
## 5  2013     1     1      554            600        -6      812            837
## 6  2013     1     1      554            558        -4      740            728
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

The function filter() selects a set of rows in a data frame. How would you select all flights occurring on February 1st?

flights |>
  dplyr::filter(month == 2 & 
                day == 1)
## # A tibble: 926 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     2     1      456            500        -4      652            648
##  2  2013     2     1      520            525        -5      816            820
##  3  2013     2     1      527            530        -3      837            829
##  4  2013     2     1      532            540        -8     1007           1017
##  5  2013     2     1      540            540         0      859            850
##  6  2013     2     1      552            600        -8      714            715
##  7  2013     2     1      552            600        -8      919            910
##  8  2013     2     1      552            600        -8      655            709
##  9  2013     2     1      553            600        -7      833            815
## 10  2013     2     1      553            600        -7      821            825
## # … with 916 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

The function arrange() reorders the rows. If you provide more than one column name as arguments, each additional column is used to break ties in the values of the preceding columns.

How would you arrange all flights in descending order of departure delay?

flights |>
  dplyr::arrange(desc(dep_delay))
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     9      641            900      1301     1242           1530
##  2  2013     6    15     1432           1935      1137     1607           2120
##  3  2013     1    10     1121           1635      1126     1239           1810
##  4  2013     9    20     1139           1845      1014     1457           2210
##  5  2013     7    22      845           1600      1005     1044           1815
##  6  2013     4    10     1100           1900       960     1342           2211
##  7  2013     3    17     2321            810       911      135           1020
##  8  2013     6    27      959           1900       899     1236           2226
##  9  2013     7    22     2257            759       898      121           1026
## 10  2013    12     5      756           1700       896     1058           2020
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Often you work with large data sets with many columns but only a few are of interest. The function select() allows us to zoom in on an interesting subset of the columns.

How would you create a data frame containing only the dates, carrier, and flight numbers?

df <- flights |>
  dplyr::select(year:day, carrier, flight)
df
## # A tibble: 336,776 × 5
##     year month   day carrier flight
##    <int> <int> <int> <chr>    <int>
##  1  2013     1     1 UA        1545
##  2  2013     1     1 UA        1714
##  3  2013     1     1 AA        1141
##  4  2013     1     1 B6         725
##  5  2013     1     1 DL         461
##  6  2013     1     1 UA        1696
##  7  2013     1     1 B6         507
##  8  2013     1     1 EV        5708
##  9  2013     1     1 B6          79
## 10  2013     1     1 AA         301
## # … with 336,766 more rows

Note here the sequence operator : to get all columns between the column labeled year and the column labeled day.

How many distinct carriers are there?

df |>
  dplyr::distinct(carrier) |>
  nrow()
## [1] 16

You include new columns with the function mutate(). Compute the time gained during flight by subtracting the departure delay (minutes) from the arrival delay.

flights |>
  dplyr::mutate(gain = arr_delay - dep_delay) |>
  dplyr::select(year:day, carrier, flight, gain) |>
  dplyr::arrange(desc(gain))
## # A tibble: 336,776 × 6
##     year month   day carrier flight  gain
##    <int> <int> <int> <chr>    <int> <dbl>
##  1  2013    11     1 VX         399   196
##  2  2013     4    18 AA         707   181
##  3  2013     8     8 UA         996   165
##  4  2013     7    10 DL        1465   161
##  5  2013     6    27 MQ        3199   157
##  6  2013     7    22 DL        1619   154
##  7  2013     7     1 DL        2395   153
##  8  2013     7    10 EV        4580   150
##  9  2013     7    22 MQ        2793   150
## 10  2013     4    18 AA        2083   148
## # … with 336,766 more rows

Determine the average departure delay.

flights |>
  dplyr::summarize(avgDelay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 1 × 1
##   avgDelay
##      <dbl>
## 1     12.6

Note that if there are missing values in a vector the function mean() needs to include the argument na.rm = TRUE otherwise the output will be NA.

y <- c(5, 6, 7, NA)
mean(y)
## [1] NA
mean(y, na.rm = TRUE)
## [1] 6

You use sample_n() and sample_frac() to take random sample of rows from the data frame. Take a random sample of five rows from the flights data frame.

flights |>
  dplyr::sample_n(5)
## # A tibble: 5 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013    11    14     1547           1550        -3     1733           1745
## 2  2013     9    11     1229           1238        -9     1319           1354
## 3  2013     7    31     1451           1452        -1     1725           1747
## 4  2013     1    10     1145           1145         0     1322           1321
## 5  2013     4    27      941            950        -9     1230           1252
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Take a random sample of 1% of the rows.

flights |>
  dplyr::sample_frac(.01)
## # A tibble: 3,368 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    11    27     1854           1900        -6     2132           2131
##  2  2013     1    25      552            600        -8      644            709
##  3  2013     1     9      658            700        -2      834            839
##  4  2013     4    19     1805           1800         5     1914           1919
##  5  2013     4    23     1600           1545        15     1805           1745
##  6  2013     6    14     1708           1715        -7     1820           1829
##  7  2013     7    27     2358           2359        -1      336            344
##  8  2013     9    10     1512           1453        19     1750           1811
##  9  2013     3    30     1901           1905        -4     2039           2114
## 10  2013     3    15     1910           1905         5     2011           2028
## # … with 3,358 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Use the argument replace = TRUE to perform a bootstrap sample. More on this later.

Random samples are important to modern data science.

The verbs are powerful when you apply them to groups of observations within a data frame. This is done with the function group_by(). Determine the average arrival delay by airplane (tail number).

flights |>
  dplyr::group_by(tailnum) |>
  dplyr::summarize(delayAvg = mean(arr_delay, na.rm = TRUE)) |>
  dplyr::arrange(desc(delayAvg))
## # A tibble: 4,044 × 2
##    tailnum delayAvg
##    <chr>      <dbl>
##  1 N844MH      320 
##  2 N911DA      294 
##  3 N922EV      276 
##  4 N587NW      264 
##  5 N851NW      219 
##  6 N928DN      201 
##  7 N7715E      188 
##  8 N654UA      185 
##  9 N665MQ      175.
## 10 N427SW      157 
## # … with 4,034 more rows

Determine the number of distinct planes and flights by destination location.

flights |>
  dplyr::group_by(dest) |>
  dplyr::summarize(planes = dplyr::n_distinct(tailnum),
            flights = dplyr::n())
## # A tibble: 105 × 3
##    dest  planes flights
##    <chr>  <int>   <int>
##  1 ABQ      108     254
##  2 ACK       58     265
##  3 ALB      172     439
##  4 ANC        6       8
##  5 ATL     1180   17215
##  6 AUS      993    2439
##  7 AVL      159     275
##  8 BDL      186     443
##  9 BGR       46     375
## 10 BHM       45     297
## # … with 95 more rows

Repeat but arrange from most to fewest planes.

Example 2: Daily weather data from Tallahassee

Let’s consider another set of data. Daily high and low temperatures and precipitation in Tallahassee.

The file (TLH_SOD1892.csv) is available in this project in the folder data).

Import the data as a data frame.

TLH.df <- readr::read_csv(file = "data/TLH_SOD1892.csv")
## Rows: 47056 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (2): STATION, NAME
## dbl  (10): LATITUDE, LONGITUDE, ELEVATION, PRCP, TAVG, TMAX, TMIN, WDF1, WSF...
## date  (1): DATE
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The data frame contains daily high (TMAX) and low (TMIN) temperatures and total precipitation (PRCP) from two stations: Airport with STATION identification USW00093805 and downtown with STATION identification USC00088754.

Use the select() function to create a new data frame with only STATION, DATE, PRCP, TMAX and TMIN.

TLH.df <- TLH.df |>
  dplyr::select(STATION, DATE, PRCP, TMAX, TMIN)
TLH.df
## # A tibble: 47,056 × 5
##    STATION     DATE        PRCP  TMAX  TMIN
##    <chr>       <date>     <dbl> <dbl> <dbl>
##  1 USW00093805 1940-03-01  0       72    56
##  2 USW00093805 1940-03-02  0       77    53
##  3 USW00093805 1940-03-03  0.05    73    56
##  4 USW00093805 1940-03-04  0       72    44
##  5 USW00093805 1940-03-05  0       61    45
##  6 USW00093805 1940-03-06  0       66    40
##  7 USW00093805 1940-03-07  0       72    36
##  8 USW00093805 1940-03-08  0       56    41
##  9 USW00093805 1940-03-09  0       60    33
## 10 USW00093805 1940-03-10  0       72    32
## # … with 47,046 more rows

Note that you’ve recycled the name of the data frame. You started with TLH.df containing all the columns and we ended with TLH.df with only the columns selected.

Then use the filter() function to keep only days at or above 90F. Similarly you recycle the name of the data frame. Use the glimpse() function to take a look at the resulting data frame.

TLH.df <- TLH.df |>
  dplyr::filter(TMAX >= 90) |>
  dplyr::glimpse()
## Rows: 10,632
## Columns: 5
## $ STATION <chr> "USW00093805", "USW00093805", "USW00093805", "USW00093805", "U…
## $ DATE    <date> 1940-05-18, 1940-05-20, 1940-05-21, 1940-05-22, 1940-05-23, 1…
## $ PRCP    <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.45, 0.00, 0.…
## $ TMAX    <dbl> 91, 92, 94, 93, 93, 90, 90, 91, 91, 91, 92, 95, 95, 95, 93, 91…
## $ TMIN    <dbl> 53, 60, 67, 64, 71, 60, 58, 62, 68, 73, 71, 72, 72, 70, 72, 70…

Note that the DATE column is a vector of dates having class date. Note if this were a character string you convert the character string into a date with the as.Date() function.

Functions from the {lubridate} package are used to extract information from dates. Here you add columns labeled Year, Month, and Day using the extractor functions year(), month(), etc.

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
TLH.df <- TLH.df |>
  dplyr::mutate(Year = year(DATE),
                Month = month(DATE),
                Day = day(DATE),
                DoW = weekdays(DATE))
TLH.df
## # A tibble: 10,632 × 9
##    STATION     DATE        PRCP  TMAX  TMIN  Year Month   Day DoW      
##    <chr>       <date>     <dbl> <dbl> <dbl> <dbl> <dbl> <int> <chr>    
##  1 USW00093805 1940-05-18  0       91    53  1940     5    18 Saturday 
##  2 USW00093805 1940-05-20  0       92    60  1940     5    20 Monday   
##  3 USW00093805 1940-05-21  0       94    67  1940     5    21 Tuesday  
##  4 USW00093805 1940-05-22  0       93    64  1940     5    22 Wednesday
##  5 USW00093805 1940-05-23  0       93    71  1940     5    23 Thursday 
##  6 USW00093805 1940-05-27  0       90    60  1940     5    27 Monday   
##  7 USW00093805 1940-05-28  0       90    58  1940     5    28 Tuesday  
##  8 USW00093805 1940-06-02  0       91    62  1940     6     2 Sunday   
##  9 USW00093805 1940-06-14  0.45    91    68  1940     6    14 Friday   
## 10 USW00093805 1940-06-17  0       91    73  1940     6    17 Monday   
## # … with 10,622 more rows

Next you keep only the temperature record from the airport. You use the filter() function on the column labeled STATION.

TLH.df <- TLH.df |>
  dplyr::filter(STATION == "USW00093805")

Now what if you want to know how many hot days (90F or higher) by year? You use the group_by() function and count using the n() function.

TLH90.df <- TLH.df |>
  dplyr::group_by(Year) |>
  dplyr::summarize(nHotDays = dplyr::n())

TLH90.df
## # A tibble: 79 × 2
##     Year nHotDays
##    <dbl>    <int>
##  1  1940       63
##  2  1941       96
##  3  1942       75
##  4  1943      101
##  5  1944       95
##  6  1945       83
##  7  1946       71
##  8  1947       94
##  9  1948       97
## 10  1949       70
## # … with 69 more rows

Note that the group_by() function results in a data frame with the first column the variable used inside the function. In this case it is Year. The next columns are defined by what is in the summarize() function.

Repeat but this time group by Month.

TLH.df |>
  dplyr::group_by(Month) |>
  dplyr::summarize(nHotDays = dplyr::n())
## # A tibble: 8 × 2
##   Month nHotDays
##   <dbl>    <int>
## 1     3        2
## 2     4      102
## 3     5      778
## 4     6     1523
## 5     7     1794
## 6     8     1746
## 7     9     1119
## 8    10      157

As expected the number of 90F+ days is highest in July and August. Note that you’ve had 90F+ days in October.

Would you expect there to be more hot days on the weekend? How would you check this?

TLH.df |>
  dplyr::group_by(Year, DoW) |>
  dplyr::summarize(nHotDays = dplyr::n())
## `summarise()` has grouped output by 'Year'. You can override using the `.groups`
## argument.
## # A tibble: 553 × 3
## # Groups:   Year [79]
##     Year DoW       nHotDays
##    <dbl> <chr>        <int>
##  1  1940 Friday          10
##  2  1940 Monday          10
##  3  1940 Saturday         7
##  4  1940 Sunday           8
##  5  1940 Thursday         9
##  6  1940 Tuesday         11
##  7  1940 Wednesday        8
##  8  1941 Friday          17
##  9  1941 Monday          12
## 10  1941 Saturday        13
## # … with 543 more rows

You can group by more than one variable. For example, add the variable Year to the group_by() function above.

Recall that you can also arrange() the data frame ordered according to the values in a particular column.

TLH90.df |>
  dplyr::arrange(desc(nHotDays))
## # A tibble: 79 × 2
##     Year nHotDays
##    <dbl>    <int>
##  1  2016      134
##  2  1990      129
##  3  2011      125
##  4  1993      119
##  5  2010      118
##  6  2015      118
##  7  2018      118
##  8  1986      116
##  9  2007      116
## 10  2000      115
## # … with 69 more rows

Putting everything together

Let’s put together your first piece of original research. You know how to import a data file, you know how to manipulate the data frame to compute something of interest, and you know how to make a graph.

Let’s do this for the number of hot days. Let’s say you want a plot of the annual number of hot days in Tallahassee since 1950. Let’s define a hot day as one where the high temperature is at least 90F.

library(ggplot2)

readr::read_csv(file = "data/TLH_SOD1892.csv") |>
  dplyr::filter(STATION == "USW00093805",
                TMAX >= 90) |>
  dplyr::mutate(Year = year(DATE)) |>
  dplyr::filter(Year >= 1950) |>
  dplyr::group_by(Year) |>
  dplyr::summarize(nHotDays = dplyr::n()) |>
ggplot(aes(x = Year, y = nHotDays)) +
  geom_point() +
  geom_smooth() +
  scale_y_continuous(limits = c(0, NA)) +
  ylab("Number of Days") +
  ggtitle("Number of Hot Days in Tallahassee Since 1950",
          subtitle = "High Temperature >= 90F") +
  theme_minimal()
## Rows: 47056 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (2): STATION, NAME
## dbl  (10): LATITUDE, LONGITUDE, ELEVATION, PRCP, TAVG, TMAX, TMIN, WDF1, WSF...
## date  (1): DATE
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

You go from data in a file to a plot of interest with a set of functions that are logically ordered and easy to read.

What would you change to make a similar plot for the number of hot nights (say where the minimum temperature fails to drop below 74)?

readr::read_csv(file = "data/TLH_SOD1892.csv") |>
  dplyr::filter(STATION == "USW00093805",
                TMIN >= 74) |>
  dplyr::mutate(Year = year(DATE)) |>
  dplyr::filter(Year >= 1950) |>
  dplyr::group_by(Year) |>
  dplyr::summarize(nHotNights = dplyr::n()) |>
ggplot(aes(x = Year, y = nHotNights)) +
  geom_point() +
  geom_smooth() +
  scale_y_continuous(limits = c(0, NA)) +
  ylab("Number of Nights") +
  ggtitle("Number of Hot Nights in Tallahassee Since 1950",
          subtitle = "Low Temperature >= 74F") +
  theme_minimal()
## Rows: 47056 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (2): STATION, NAME
## dbl  (10): LATITUDE, LONGITUDE, ELEVATION, PRCP, TAVG, TMAX, TMIN, WDF1, WSF...
## date  (1): DATE
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Make a similar plot showing the total precipitation by year.

readr::read_csv(file = "data/TLH_SOD1892.csv") |>
  dplyr::filter(STATION == "USW00093805") |>
  dplyr::mutate(Year = year(DATE)) |>
  dplyr::filter(Year >= 1950) |>
  dplyr::group_by(Year) |>
  dplyr::summarize(TotalPrecip = sum(PRCP)) |>
ggplot(aes(x = Year, y = TotalPrecip)) +
  geom_point() +
  geom_smooth() +
  scale_y_continuous(limits = c(0, NA)) +
  ylab("Total Precipitation by Year") +
  theme_minimal()
## Rows: 47056 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (2): STATION, NAME
## dbl  (10): LATITUDE, LONGITUDE, ELEVATION, PRCP, TAVG, TMAX, TMIN, WDF1, WSF...
## date  (1): DATE
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).

Example 3: Food consumption and CO2 emissions

Source: https://www.nu3.de/blogs/nutrition/food-carbon-footprint-index-2018

fc.df <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-18/food_consumption.csv')
## Rows: 1430 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): country, food_category
## dbl (2): consumption, co2_emmission
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(fc.df)
## # A tibble: 6 × 4
##   country   food_category consumption co2_emmission
##   <chr>     <chr>               <dbl>         <dbl>
## 1 Argentina Pork                10.5          37.2 
## 2 Argentina Poultry             38.7          41.5 
## 3 Argentina Beef                55.5        1712   
## 4 Argentina Lamb & Goat          1.56         54.6 
## 5 Argentina Fish                 4.36          6.96
## 6 Argentina Eggs                11.4          10.5

Consumption is kg/person/year and CO2 emission is kg CO2/person/year.

  1. How many different countries are in the data frame?
fc.df |>
  dplyr::distinct(country) |>
  nrow()
## [1] 130
  1. Arrange the countries from most pork consumption per person to the least pork consumption.
fc.df |>
  dplyr::filter(food_category == "Pork") |>
  dplyr::select(country, consumption) |>
  dplyr::arrange(desc(consumption))
## # A tibble: 130 × 2
##    country              consumption
##    <chr>                      <dbl>
##  1 Hong Kong SAR. China        67.1
##  2 Austria                     52.6
##  3 Germany                     51.8
##  4 Spain                       48.9
##  5 Poland                      46.2
##  6 Lithuania                   45.7
##  7 Luxembourg                  43.6
##  8 Croatia                     42.8
##  9 Czech Republic              41.2
## 10 Belarus                     40.4
## # … with 120 more rows
  1. Arrange the countries from the largest carbon footprint with respect to eating habits to the smallest carbon footprint.
fc.df |>
  dplyr::rename(co2_emission = co2_emmission) |>
  dplyr::group_by(country) |>
  dplyr::summarize(totalEmission = sum(co2_emission)) |>
  dplyr::arrange(desc(totalEmission))
## # A tibble: 130 × 2
##    country     totalEmission
##    <chr>               <dbl>
##  1 Argentina           2172.
##  2 Australia           1939.
##  3 Albania             1778.
##  4 New Zealand         1751.
##  5 Iceland             1731.
##  6 USA                 1719.
##  7 Uruguay             1635.
##  8 Brazil              1617.
##  9 Luxembourg          1598.
## 10 Kazakhstan          1575.
## # … with 120 more rows

Summary

Data munging is a big part of data science. Data science is an iterative cycle:

  1. Generate questions about our data.
  2. Search for answers by transforming, visualizing, and modeling the data.
  3. Use what you learn to refine our questions and/or ask new ones.

You use questions as tools to guide our investigation. When you ask a question, the question focuses our attention on a specific part of our data set and helps us decide what to do.

For additional practice please check out http://r4ds.had.co.nz/index.html.

Cheat sheets http://rstudio.com/cheatsheets

Thursday, September 14, 2022

Today

  • Making graphs

Data visualization is a cornerstone of data science. It gives insights into your data that are not accessible by looking at a spreadsheet or data frame of values.

The {ggplot2} package provides functions to make plots efficiently. The functions are an application of the grammar of graphics theory (Leland Wilkinson) of data visualization.

At a basic level, graphics/plots/charts (all interchangeable terms) provide a way to explore the patterns in data; the presence of extreme values, distributions of individual variables, and relationships between groups of variables.

Graphics should emphasize the findings and insights you want your audience to understand. This requires a balance.

On the one hand, you want to highlight as many interesting findings as possible. On the other hand, you don’t want to include so much information that it overwhelms the audience.

The grammar of graphics specifies how a plot translates data to attributes and geometric objects. - Attributes are things like location on along an axes, color, shape, and size. - Geometric objects are things like points, lines, bars, and polygons.

The type of plot depends on the geometric object, which is specified as a function.

Function names for geometric objects begin with geom_. For example, to create a scatter plot of points the geom_point() function is used.

Make the functions from the {ggplot2} package available in your current session.

library(ggplot2)

Bar chart

A simple graph is the bar chart showing the number of cases within each group. Consider again the annual hurricane counts.

Import the data from the file on my website and print the first six rows.

loc <- "http://myweb.fsu.edu/jelsner/temp/data/US.txt"
LH.df <- readr::read_table(loc)
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Year = col_double(),
##   All = col_double(),
##   MUS = col_double(),
##   G = col_double(),
##   FL = col_double(),
##   E = col_double()
## )
dplyr::glimpse(LH.df)
## Rows: 166
## Columns: 6
## $ Year <dbl> 1851, 1852, 1853, 1854, 1855, 1856, 1857, 1858, 1859, 1860, 1861,…
## $ All  <dbl> 1, 3, 0, 2, 1, 2, 1, 1, 1, 3, 2, 0, 0, 0, 2, 1, 1, 0, 4, 2, 3, 0,…
## $ MUS  <dbl> 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ G    <dbl> 0, 1, 0, 1, 1, 1, 0, 0, 1, 3, 0, 0, 0, 0, 1, 1, 1, 0, 2, 1, 0, 0,…
## $ FL   <dbl> 1, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 2, 0,…
## $ E    <dbl> 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 2, 0, 1, 0,…

Recall that each case is a year and that the function table() returns the number of years for each landfall count.

table(LH.df$All)
## 
##  0  1  2  3  4  5  6  7 
## 36 50 40 27  6  1  5  1

The number of cases for each count is tallied and displayed below the count. There were 36 cases of 0 hurricanes.

The function geom_bar() creates a bar chart of this frequency table.

ggplot(data = LH.df) + 
  geom_bar(mapping = aes(x = All))

You begin a plot with the function ggplot() that creates a coordinate system that you add layers to. The first argument of ggplot() is the data frame to use in the graph. So ggplot(data = LH.df) creates an empty graph.

You complete the graph by adding one or more layers. The function geom_bar() adds a layer of bars to our plot, which creates a bar chart.

Each geom_ function takes a mapping argument. This defines how variables in our data frame are mapped to visual properties. The mapping argument is always paired with aes() function, and the x argument of aes() specifies which variables to map to the x axes, in this case All. ggplot() looks for the mapped variable in the data argument, in this case, LH.df.

The function geom_bar() tables the counts and then maps the number of cases to bars with the bar height proportional to the number of cases. Here the number of cases is the number of years with that many hurricanes.

The functions are applied in order (ggplot() comes before geom_bar()) and are linked with the addition + symbol. In this way you can think of the functions as layers in a GIS.

The bar chart contains the same information as displayed by the function table(). The y-axis label is ‘count’ and x-axis label is the column name.

Repeat this time using Florida hurricane counts. The annual number of Florida hurricanes by year is given in column FL in the data frame LH.df.

LH.df$FL
##   [1] 1 2 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 2 0 2 1 0 1 2 1 0 3 0 2 0 0 0 3 1
##  [38] 2 0 0 0 0 1 2 0 3 1 1 1 0 1 0 1 0 0 2 0 0 1 1 1 0 0 0 1 2 1 0 1 0 1 0 0 2
##  [75] 1 2 0 2 1 0 0 0 2 2 2 1 0 0 1 0 1 2 0 1 2 1 2 2 1 2 0 0 1 0 0 1 0 0 0 1 0
## [112] 0 0 3 1 2 1 1 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 2 0 1 0 1 0 0 1 0 0 2 0 0 2
## [149] 1 0 0 0 0 4 3 0 0 0 0 0 0 0 0 0 0 1

The geom_bar() function tables these numbers and plots the frequency as a bar.

ggplot(data = LH.df) + 
  geom_bar(mapping = aes(x = FL)) +
  xlab("Number of Florida Hurricanes (1851-2016)") +
  ylab("Number of Years")

Here axes labels are placed on the plot with the functions ylab() and xlab(). With this type of ‘layering’ it’s easy to go from data on the web to a publishable plot.

Pie preference

Thirty graduate students are surveyed about their favor pie. Categories are (1) chocolate cream, (2) coconut custard, (3) georgia peach, and (4) strawberry. To make a bar chart first create the data as a character vector and then change the vector to a data frame.

pie <- c(rep('chocolate cream', times = 4), 
         rep('coconut custard', times =  12), 
         rep('georgia peach', times =  5), 
         rep('strawberry', times =  9))
piePref.df <- as.data.frame(pie)

Use the function str() to see the column type in the data frame.

str(piePref.df)
## 'data.frame':    30 obs. of  1 variable:
##  $ pie: chr  "chocolate cream" "chocolate cream" "chocolate cream" "chocolate cream" ...

There is a single column in the data frame with the name pie. It is a factor variable with 4 levels one for each type of pie. A factor is a categorical vector. It looks like a character but it can be ordered. This is important when factors are used in statistical models.

Create a table.

table(piePref.df$pie)
## 
## chocolate cream coconut custard   georgia peach      strawberry 
##               4              12               5               9

Create a bar chart and specify the axis labels.

ggplot(data = piePref.df) + 
  geom_bar(mapping = aes(x = pie)) +
  xlab("Pie Preference") + 
  ylab("Number of Students")

This is a good start. Improvements should be made.

First, the bar order is alphabetical from left to right. This is the default ordering for character vectors or for factor variables created from character vectors. It is much easier to make comparisons if frequency determines the order.

To change the order on the bar chart specify the order of the factor levels on the vector beer.

pie <- factor(pie, 
               levels = c("coconut custard", "strawberry", "georgia peach", "chocolate cream"))
piePref.df <- as.data.frame(pie)

Now remake the bar chart.

ggplot(data = piePref.df) + 
  geom_bar(mapping = aes(pie)) +
  xlab("Pie Preference") + 
  ylab("Number of Students")

Second, the vertical axis tic labels are fractions. Since the bar heights are counts (integers) the tic labels also should be integers.

To override this default you add a new y-axis layer. The layer is the function scale_y_continuous() where you indicate the lower and upper limits of the axis with the concatenate (limits = c()) function. Now remake the bar chart.

ggplot(data = piePref.df) + 
  geom_bar(mapping = aes(pie)) +
  xlab("Beer Preference") + 
  ylab("Number of Students") +
  scale_y_continuous(limits = c(0, 15))

Now the chart is publishable. Options exist for changing the look of the plot for digital media include, colors, orientation, background, etc.

For example to change the bar color use the fill = argument in the function geom_bar(). To change the orientation of the bars use the layer function coord_flip, and to change the background use the layer function theme_minimal(). You make changes to the look of the plot with additional layers.

ggplot(data = piePref.df) + 
  geom_bar(mapping = aes(x = pie), fill = "blue") +
  xlab("Pie Preference") + 
  ylab("Number of Students") +
  scale_y_continuous(limits = c(0, 15)) +
  coord_flip() +
  theme_minimal()

Recall: the fill = is used on the variable named in the aes() function but it is specified outside the aes() function.

Available colors include

colors()

In the above example you manually reordered the levels in the factor vector pie according to preference. Let’s see how to do this automatically.

Consider storm intensity of tropical cyclones during 2017. First create two vectors one numeric containing the minimum pressures (millibars) and the other character containing the storm names.

minP <- c(990, 1007, 992, 1007, 1005, 981, 967, 938, 914, 938, 972, 971)
name <- c("Arlene", "Bret", "Cindy", "Don", "Emily", "Franklin", "Gert", 
         "Harvey", "Irma", "Jose", "Katia", "Lee")

The function reorder() takes a character vector as the first argument and returns an ordered factor with the order dictated by the numeric values in the second argument.

reorder(name, minP)
##  [1] Arlene   Bret     Cindy    Don      Emily    Franklin Gert     Harvey  
##  [9] Irma     Jose     Katia    Lee     
## attr(,"scores")
##   Arlene     Bret    Cindy      Don    Emily Franklin     Gert   Harvey 
##      990     1007      992     1007     1005      981      967      938 
##     Irma     Jose    Katia      Lee 
##      914      938      972      971 
## 12 Levels: Irma Harvey Jose Gert Lee Katia Franklin Arlene Cindy Emily ... Don

The vector name is in alphabetically order but the factor levels indicate the order of storms from lowest pressure (Irma) to highest pressure (Don).

Using the mutate() function you can add a column to a data frame where the column is an ordered factor.

Note that it is the difference in pressure (deltaP for short) between the air outside the tropical cyclone and the air in the center that causes the winds. Cyclones with a large pressure difference are stronger in terms of their wind speed.

Typically the air outside is about 1014 mb so you compute deltaP and then reorder the tropical cyclone names using this computed variable.

df <- data.frame(name, minP) |>
  dplyr::mutate(deltaP = 1014 - minP,
                nameOrderedFactor = reorder(name, deltaP))

Finally you plot the bar chart. Since there is no tabulation of the values you use geom_col() instead of geom_bar().

ggplot(data = df) + 
  geom_col(mapping = aes(x = nameOrderedFactor, y = deltaP)) +
  ylab("Pressure Difference [mb]") +
  xlab("Atlantic Tropical Cyclones of 2017") +
  coord_flip()

Note: geom_bar() plots a bar chart AFTER tabulating a column. geom_col() plots a bar chart on a pre-tabulated column.

Let’s return to the weather data from Tallahassee.

df <- readr::read_csv(file = "data/TLH_SOD1892.csv") |>
  dplyr::filter(STATION == "USW00093805") |>
  dplyr::mutate(Year = lubridate::year(DATE),
         Month = lubridate::month(DATE)) |>
  dplyr::filter(Year >= 1980 & Month == 9) |>
  dplyr::group_by(Year) |>
  dplyr::summarize(TotalPrecip = sum(PRCP)) |>
  dplyr::mutate(Year = reorder(as.factor(Year), TotalPrecip))
## Rows: 47056 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (2): STATION, NAME
## dbl  (10): LATITUDE, LONGITUDE, ELEVATION, PRCP, TAVG, TMAX, TMIN, WDF1, WSF...
## date  (1): DATE
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ggplot(data = df) + 
  geom_col(mapping = aes(x = Year, y = TotalPrecip)) +
  ylab("September Rainfall [in]") +
  coord_flip()

Histogram

The histogram is similar to the bar chart except it uses bars to indicate frequency (or proportion) over an interval of continuous values. For instance, with continuous values the function table() is not useful.

x <- rnorm(n = 10)
table(x)
## x
##    -1.6032276455739   -1.21105005355005  -0.930597500697453  -0.659731635004227 
##                   1                   1                   1                   1 
##  -0.638249710440162  -0.499239870677882 -0.0277773938058075   0.277103773302114 
##                   1                   1                   1                   1 
##   0.526728668221472   0.545780025214686 
##                   1                   1

So neither is a bar plot.

A histogram is made as follows: First a collection of disjoint intervals, called bins, covering the range of data points is chosen. “Disjoint” means no overlap, so the intervals look like (a,b] or [a,b). The interval (a,b] means the interval contains all the values from a to b including b but not a, whereas the interval [a,b) means the interval contains all the values from a to b including a but not b.

Second, the number of data values in each of these intervals is counted. Third, a bar is drawn above the interval so that the area of the bar is proportional to the frequency. If the intervals defining the bins have the same width, then the height of the bar is proportional to the frequency (the number of values inside the interval).

Let’s return to the Florida precipitation data.

loc <- "http://myweb.fsu.edu/jelsner/temp/data/FLprecip.txt"
FLp.df <- readr::read_table(loc)
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Year = col_double(),
##   Jan = col_double(),
##   Feb = col_double(),
##   Mar = col_double(),
##   Apr = col_double(),
##   May = col_double(),
##   Jun = col_double(),
##   Jul = col_double(),
##   Aug = col_double(),
##   Sep = col_double(),
##   Oct = col_double(),
##   Nov = col_double(),
##   Dec = col_double()
## )

Recall that the columns in the data frame FLp.df are months (variables) and rows are years. Year is an integer (int) vector and the months are numeric (num) vectors. Create a histogram of May precipitation.

ggplot(data = FLp.df) + 
  geom_histogram(mapping = aes(x = May), col = "white")  +
  xlab("May Precipitation in Florida (in)") 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

By default the function geom_histogram() picks 30 bins. Since there are only 118 May values many of the bins have fewer than 5 values.

When making a histogram you need to vary the number of bins before deciding on a final plot. This can be done with the bins = or binwidth = argument. For example, the look of the histogram is improved by halving the default number of bins.

ggplot(data = FLp.df) + 
  geom_histogram(mapping = aes(x = May), col = "white", bins = 15)  +
  xlab("May Precipitation in Florida (in)") 

It looks even better by decreasing the number of bins to 11.

ggplot(data = FLp.df) + 
  geom_histogram(mapping = aes(x = May), col = "white", bins = 11, fill = "green3")  +
  xlab("May Precipitation in Florida (in)") +
  ylab("Number of Years")

Here the fill = argument is used to change color and a ylab() layer is added to make the y-axis label more concise.

The geom_rug() layer adds the location of the data values as tic marks just above the horizontal axis. And the color = "white" is the color of the bin boundaries.

ggplot(data = FLp.df) + 
  geom_histogram(mapping = aes(x = May), col = "white", bins = 11, fill = "green3")  +
  xlab("May Precipitation in Florida (in)") +
  ylab("Number of Years") +
  geom_rug(mapping = aes(x = May))

ggplot(data = FLp.df, mapping = aes(x = May)) + 
  geom_histogram(col = "black", bins = 11, fill = "pink")  +
  xlab("May Precipitation in Florida (in)") +
  ylab("Number of Years") +
  geom_rug()

Density plot

A density plot is a smoothed histogram with units of probability on the vertical axis. It’s motivated by the fact that for a continuous variable, the probability that the variable takes on any particular value is 0. Instead you need a range of values over which a probability is defined.

The probability density answers the question, what is the chance that a value falls in a small interval. This chance varies depending on where the value is located within the distribution of all values (e.g., near the middle of the distribution the chance is highest).

ggplot(data = FLp.df) +
  geom_density(mapping = aes(x = May)) +
  xlab("May Precipitation in Florida (in)") 

The vertical axis is the average chance that rainfall will take on a value along the horizontal axis within a given small interval. The size of the interval is determined by the bandwidth (bw =).

The values along the vertical axis depends on the data units. It can be tricky to interpret. Instead geom_freqpoly() produces a density-like graph where the units on the y-axis are counts as with the histogram.

ggplot(data = FLp.df, aes(x = May)) + 
  geom_freqpoly(color = "green3", binwidth = 1) +
  xlab("May Precipitation in Florida (in)") +
  geom_rug()

Box plot

The box plot graphs the summary statistics. These statistics include the minimum value, the maximum value, the 1st & 3rd quartile values, and the median value. The easiest way to create a box plot is to use the function boxplot().

boxplot(FLp.df$May)

The function boxplot() is from the base {graphics} package. It is not a {ggplot2} function. Others from this package include hist() for histograms and plot() for scatter plots.

The base graphics lets you manipulate details of a graph. For example:

boxplot(FLp.df$May, 
        ylab = "May Precipitation in FL (in)")
f <- fivenum(FLp.df$May)
text(rep(1.3, 5), f, labels = c("Minimum", "1st Quartile", 
                                "Median", "3rd Quartile",
                                "Maximum"))
text(1.3, 7.792, labels = "Last Value Within\n 1.5xIQR Above 3rd Q")

The box plot illustrates the five numbers graphically. The median is the line through the box. The bottom and top of the box are the 1st and 3rd quartile values. Whiskers extend vertically from the box downward toward the minimum and upward toward the maximum.

If values extend beyond 1.5 times the interquartile range (either above or below the corresponding quartile) the whisker is truncated at the last value within the range and points are used to indicate outliers.

To make a box plot using the function ggplot() you need a dummy variable for the x argument in the function aes(). This is done with x = "".

ggplot(FLp.df) + 
  geom_boxplot(mapping = aes(x = "", y = May)) +
  xlab("") + 
  ylab("May Precipitation in Florida (in)")

Side-by-side box plots

Suppose you want to show box plots for each month. In this case you make the x argument in the aes() the name of a column containing the vector of month names.

You first turn the data frame from its native ‘wide’ format to a {ggplot2} friendly ‘long’ format.

Wide format data is called ‘wide’ because it typically has a lot of columns that stretch across our computer screen. Long format data is called ‘long’ because it has fewer columns while preserving all the information. In order to do have fewer columns, it has to be longer.

Wide format data are most common. They are convenient for data entry. They let us see more of the data at one time. For example, the FLp.df data frame.

head(FLp.df)
## # A tibble: 6 × 13
##    Year   Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov   Dec
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  1895 3.28   3.24  2.50 4.53   4.25  4.5   7.45  6.10  4.67  3.09 2.65   1.59
## 2  1896 3.93   3.02  2.57 0.498  2.7  11.2   8.22  5.89  4.35  2.96 3.52   2.07
## 3  1897 1.84   6     2.12 4.39   2.28  5.22  7.21  6.83 11.1   4.10 1.75   2.68
## 4  1898 0.704  2.01  1.26 1.32   1.51  3.29  8.95 13.1   5.23  5.88 2.19   3.89
## 5  1899 4.52   5.92  1.90 3.40   1.11  5.80  9.26  6.71  5.13  5.88 0.751  1.94
## 6  1900 3.21   4.37  6.8  4.32   3.89  9.99  7.50  4.49  4.93  5.23 1.22   4.29

The long data format is less familiar. It corresponds to the relational model for storing data used by most modern databases like SQL.

Use the pivot_longer() function from the {tidyr} package to turn the wide data frame into a long data frame. Let’s do it and then decipher what happens.

library(tidyr)

FLpL.df <- FLp.df |>
  tidyr::pivot_longer(cols = -Year, 
                      names_to = "Month",
                      values_to = "Precipitation")

str(FLpL.df)
## tibble [1,404 × 3] (S3: tbl_df/tbl/data.frame)
##  $ Year         : num [1:1404] 1895 1895 1895 1895 1895 ...
##  $ Month        : chr [1:1404] "Jan" "Feb" "Mar" "Apr" ...
##  $ Precipitation: num [1:1404] 3.28 3.24 2.5 4.53 4.25 ...

Note that the column Month is a character vector. When making a plot using this variable the order will be alphabetical. So instead you change it to a factor vector with levels equal to the month abbreviations.

month.abb
##  [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
FLpL.df <- FLpL.df |>
  dplyr::mutate(Month = factor(Month, levels = month.abb))

The pivot_longer() function takes all the columns to pivot into a longer format. Here chose them all EXCEPT the one named after the - sign (Year). All variables are measured (precipitation in units of inches) except Year.

The resulting long data frame has the Year variable in the first column and the remaining column names as the name variable in the second column.

You change the default name to Month by specifying the names_to = "Month" argument. The third column contains the corresponding precipitation values all in a single column names value.

You change the default value by specifying the values_to = "Precipitation".

Note that you reverse this with the pivot_wider() function.

FLpW.df <- FLpL.df |>
  tidyr::pivot_wider(id_cols = Year,
                     names_from = Month,
                     values_from = Precipitation)

To help conceptualize what is going on take a look at this gif.

Then to create the box plot specify that the x-axis be the key variable (here Month) and the y-axis to be the measured variable (here Precipitation).

ggplot(data = FLpL.df) + 
  geom_boxplot(mapping = aes(x = Month, y = Precipitation)) +
  ylab("Precipitation (in)")

This is a climograph.

Each geom_ function is a layer. Data for the layer is specified in the function ggplot() with the data frame argument and the aes() function. To add another layer to the plot with different data you specify the data within the geom function. For example, lets repeat the climograph of monthly precipitation highlighting the month of May.

You add a geom_boxplot() layer and specify a subset of the data using the subset [] operator when specifying the data = argument.

ggplot(data = FLpL.df, 
       aes(x = Month, y = Precipitation)) + 
  geom_boxplot() +
  ylab("Precipitation (in)") +
  geom_boxplot(data = FLpL.df[FLpL.df$Month == "May", ], 
               aes(x = Month, y = Precipitation), 
               fill = "green")

Cheat sheets: https://ggplot2tor.com/cheatsheets/ Additional help: See: https://moderndive.com/2-viz.html

Tuesday, September 19, 2022

Today

  • More about making graphs in R

Comparing distributions

Previously you learned how to make a histogram from data. To review, consider again the Florida rainfall data.

Import the data.

loc <- "http://myweb.fsu.edu/jelsner/temp/data/FLprecip.txt"
FLp.df <- readr::read_table(loc, na = "-9.900")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Year = col_double(),
##   Jan = col_double(),
##   Feb = col_double(),
##   Mar = col_double(),
##   Apr = col_double(),
##   May = col_double(),
##   Jun = col_double(),
##   Jul = col_double(),
##   Aug = col_double(),
##   Sep = col_double(),
##   Oct = col_double(),
##   Nov = col_double(),
##   Dec = col_double()
## )

Then use ggplot() and geom_histogram() functions to make a histogram of rainfall during March and add a label on the horizontal axis (x-axis). Here you assign the plot to an object called p1. An list object is created in your environment but nothing is plotted until you type the object name.

library(ggplot2)

p1 <- ggplot(data = FLp.df) +
             geom_histogram(mapping = aes(x = Mar), 
                               bins = 11, 
                               fill = "green3",
                                col = "white") +
             xlab("March Rainfall in Florida (in)") 
p1

The histogram shows the shape of the distribution. The distribution is made up of all 118 years of March rainfall. Most years have rainfall values between 2 and 4 inches. A few years have values that exceed 7.5 inches.

The average, median, and standard deviations are obtained as follows:

FLp.df |>
  dplyr::select(Mar) |>
  dplyr::summarize(avg = mean(Mar),
                   med = median(Mar),
                   sd = sd(Mar),
                   min = min(Mar),
                   max = max(Mar))
## # A tibble: 1 × 5
##     avg   med    sd   min   max
##   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  3.66  3.35  1.95 0.496  8.70

The average value is larger than the median value and the histogram is not symmetric. That is, the number of cases with with low rainfall exceeds the number of cases with heavy rainfall.

The histogram helps us to describe the statistical distribution of the values.

To see this, recall that you can generate values from any distribution. For example you generate values from a normal (Guassian distribution) with the rnorm() function by specifying the mean and the standard deviation.

Here you do this using the mean and standard deviation from our rainfall values. Since there are 118 March rainfall values (one for each year) you set n = 118.

nd <- rnorm(n = 118, 
            mean = 3.65, 
            sd = 1.95)
nd
##   [1]  6.7416884  0.2139696  2.7433698  3.8824960  0.6194203  4.4354257
##   [7]  5.4050169  0.8092758 -0.3419746  7.5958964 -2.5483778  5.5816917
##  [13]  3.1746390  5.8613094  5.6541227  0.9086666  6.3098643  2.7549266
##  [19]  3.8427015  6.3239940  3.4684659  5.3702174  1.4655571  2.6863474
##  [25]  4.3661373  7.5019611  4.1326324  3.5311833  3.0550273  3.0955286
##  [31]  3.5985192 -1.0867786 -0.4001121  1.8175477  4.3137304  6.5331517
##  [37]  3.2289950  1.1430855  3.4980687  2.4295630  2.7532689  4.8229282
##  [43]  2.7737379  4.3975801 -1.3843887  5.9203663  1.2241661  3.9597285
##  [49]  3.9163964  5.6894578  3.8132714  3.1678973  4.3318985  7.7737934
##  [55]  5.9390685  3.5930825  7.1304144  1.2929749  2.6178412  3.6042137
##  [61]  5.7586311  4.9468082  1.7642494  2.2107400  1.0706937  2.2323835
##  [67]  0.4622967  6.4902349  3.4607394  2.8777164  2.2998399  4.1054467
##  [73]  0.1484932  2.1125788  3.7443694  3.3144648  5.9140059  0.5406493
##  [79]  6.3481199  3.2473482  4.8429054  4.3916607  6.1921813  3.9140055
##  [85]  2.9784233  2.1965664  4.1856974  3.5287084  2.4022348  2.4280930
##  [91]  5.0584030  5.7855713  3.7728832  1.2280599  4.3302056  4.2992629
##  [97]  6.4965545  4.7139889  5.5938008  2.7929664  1.1824149  2.6774331
## [103]  6.9224717  4.0089298  4.6029910  2.7755865  2.7627575  5.0998429
## [109]  2.2951031  1.6149303  5.5045684  6.8846473  3.2636776  6.2280476
## [115]  1.7441248  4.7856600  4.2775693  5.0885180

Collectively these values look quite a bit like the actual rainfall. Let’s make a histogram from these 118 values and assign it to p2.

df <- data.frame(nd)
p2 <- ggplot(data = df) +
        geom_histogram(mapping = aes(x = nd), 
                       bins = 11, 
                       col = "white") +
        xlab("Gaussian Distribution")
p2

Let’s do the same for a set of values from a uniform distribution and from a gamma distribution.

ud <- runif(n = 118,
            min = .5, 
            max = 8.7)

p3 <- ggplot(data = df) +
        geom_histogram(mapping = aes(x = ud), 
                       bins = 11, 
                       col = "white") +
        xlab("Uniform Distribution")

gd <- rgamma(n = 118, 
             shape = 3.2,
             rate = .9)

p4 <- ggplot(data = df) +
        geom_histogram(mapping = aes(x = gd), 
                       bins = 11, 
                       col = "white") +
        xlab("Gamma Distribution")

Now put all four plots on a single graph. You do this with the {patchwork} package.

The package gives operators like + and / different meanings when applied to ggplot objects.

library(patchwork)
## 
## Attaching package: 'patchwork'
## The following object is masked from 'package:MASS':
## 
##     area
(p1 + p2) / (p3 + p4)

What distribution best matches the shape of the March rainfall values?

Box plots

A box plot graphically illustrates summary statistics. The summary statistics include the minimum value, the maximum value, the 1st & 3rd quartile values, and the median value.

A non-ggplot way to create a box plot is to use the function boxplot(). Here you get a box plot of the May rainfall.

boxplot(FLp.df$May)

The function boxplot() is from the base {graphics} package. Others from this package include hist() for histograms and plot() for scatter plots.

The base graphics lets you manipulate details of a graph. For example:

boxplot(FLp.df$May, 
        ylab = "May Rainfall in FL (in)")
f <- fivenum(FLp.df$May)
text(rep(1.3, 5), f, labels = c("Minimum", "1st Quartile", 
                                "Median", "3rd Quartile",
                                "Maximum"))
text(1.3, 7.792, labels = "Last Value Within\n 1.5xIQR Above 3rd Q")

The box plot illustrates the five numbers graphically. The median is the line through the box. The bottom and top of the box are the 1st and 3rd quartile values. Whiskers extend vertically from the box downward toward the minimum and upward toward the maximum.

If values extend beyond 1.5 times the interquartile range (either above or below the corresponding quartile) the whisker is truncated at the last value within the range and points are used to indicate outliers.

To make the same box plot using functions from the {ggplot2} package you use the geom_boxplot() layer.

ggplot(data = FLp.df) + 
  geom_boxplot(mapping = aes(y = May)) +
  xlab("") + 
  ylab("May Rainfall in Florida (in)")

Long data frames

Suppose you want to make a separate box plot for each month. In this case you make the x aesthetic the name of a column containing the vector of month names. The problem is that the month names are column labels rather than a single character vector.

You need to turn the data frame from its native ‘wide’ format to a ‘long’ format. The FLp.df is ‘wide’ because there are separate columns for each month. Wide data are more common because they are convenient for entering data and they let you see more of the data at one time.

head(FLp.df)
## # A tibble: 6 × 13
##    Year   Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov   Dec
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  1895 3.28   3.24  2.50 4.53   4.25  4.5   7.45  6.10  4.67  3.09 2.65   1.59
## 2  1896 3.93   3.02  2.57 0.498  2.7  11.2   8.22  5.89  4.35  2.96 3.52   2.07
## 3  1897 1.84   6     2.12 4.39   2.28  5.22  7.21  6.83 11.1   4.10 1.75   2.68
## 4  1898 0.704  2.01  1.26 1.32   1.51  3.29  8.95 13.1   5.23  5.88 2.19   3.89
## 5  1899 4.52   5.92  1.90 3.40   1.11  5.80  9.26  6.71  5.13  5.88 0.751  1.94
## 6  1900 3.21   4.37  6.8  4.32   3.89  9.99  7.50  4.49  4.93  5.23 1.22   4.29

You can reduce the number of columns by stacking the rainfall values into a single column and then labeling the rows by month. This preserves all the information from the wide format but does so with fewer columns.

The long data format is less familiar. It corresponds to the relational model for storing data used by databases like SQL.

Consider the following wide data frame with column names w, x, y, and z. id w x y z 1 A C E G 2 B D F H

The long data frame version would be id name value 1 w A 1 x C 1 y E 1 z G 2 w B 2 x D 2 y F 2 z H

You use the pivot_longer() function from the {tidyr} package to turn the wide data frame into a long data frame. Let’s do it and then decipher what happens.

FLpL.df <- FLp.df |>
  tidyr::pivot_longer(cols = -Year, 
                      names_to = "Month",
                      values_to = "Rainfall")

str(FLpL.df)
## tibble [1,404 × 3] (S3: tbl_df/tbl/data.frame)
##  $ Year    : num [1:1404] 1895 1895 1895 1895 1895 ...
##  $ Month   : chr [1:1404] "Jan" "Feb" "Mar" "Apr" ...
##  $ Rainfall: num [1:1404] 3.28 3.24 2.5 4.53 4.25 ...

The pivot_longer() function takes all the columns to pivot into a longer format. Here you chose them all EXCEPT the one named after the - sign (Year). All variables are measured (rainfall in units of inches) except Year.

The resulting long data frame has the Year variable in the first column and the remaining column names as the name variable in the second column. You change the default name to Month by specifying the names_to = "Month" argument. The third column contains the corresponding rainfall values all in a single column names value. You change the default value by specifying the values_to = "Rainfall".

Note that the column Month is a character vector. When you make a plot using this variable the order will be alphabetical. So you change the variable from a character vector to a factor vector with levels equal to the month abbreviations.

month.abb
##  [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
FLpL.df <- FLpL.df |>
  dplyr::mutate(Month = factor(Month, levels = month.abb))

Note that you can reverse this with the pivot_wider() function.

FLpW.df <- FLpL.df |>
  tidyr::pivot_wider(id_cols = Year,
                     names_from = Month,
                     values_from = Rainfall)

Then to create the box plot specify that the x aesthetic (x-axis) to be Month and the y-axis to be Rainfall.

ggplot(data = FLpL.df) + 
  geom_boxplot(mapping = aes(x = Month, y = Rainfall)) +
  ylab("Rainfall (in)")

The graph shows the variation of rainfall by month.

Each geom_ function is a layer. Data for the layer is specified in the function ggplot() with the data frame argument and the aes() function. To add another layer to the plot with different data you specify the data within the geom_ function.

For example, lets repeat the graph of monthly rainfall highlighting the month of May. First you filter the data frame keeping only rows labeled May and assign this to a new data frame object called May.df.

You then repeat the plot but add another geom_boxplot() layer that includes the argument data = May.df along with the corresponding aes() function. Finally you color the box green.

May.df <- FLpL.df |>
  dplyr::filter(Month == "May")

ggplot(data = FLpL.df, aes(x = Month, y = Rainfall)) + 
  geom_boxplot() +
  ylab("Rainfall (in)") +
  geom_boxplot(data = May.df, 
               mapping = aes(x = Month, y = Rainfall), 
               fill = "green") +
  theme_minimal()

Scatter plots

An import graph is the scatter plot which shows the relationship between two numeric variables. It plots the values of one variable against the values of the other as points \((x_i, y_i)\) in a Cartesian plane.

For example, to show the relationship between April and September values of rainfall you type

ggplot(FLp.df) + 
  geom_point(mapping = aes(x = Apr, y = Sep)) + 
  xlab("April Rainfall (in)") + 
  ylab("September Rainfall (in)")

The plot shows that dry Aprils tend to be followed by dry Septembers and wet Aprils tend to be followed by wet Septembers.

There is a direct (or positive) relationship between the two variables although the points are scattered widely indicating the relationship is loose.

If your goal is to model the relationship, you plot the dependent variable (the variable you are interested in modeling) on the vertical axis.

Here you put the September values on the vertical axis since a predictive model would use April values to predict September values because April comes before September in the calendar year.

If the points have a natural ordering then you use the geom_line() function. For example, to plot the September Rainfall values as a time series type

ggplot(FLp.df) + 
  geom_line(mapping = aes(x = Year, y = Sep)) + 
  xlab("Year") + 
  ylab("September Rainfall (in)")

Rainfall values fluctuate from one September to the next, but there does not appear to be a long-term trend. With time series data it is better to connect the values with lines rather than use points unless values are missing.

Create a plot of the May values of the North Atlantic oscillation (NAO) with Year on the horizontal axis. Add appropriate axis labels.

loc <- "http://myweb.fsu.edu/jelsner/temp/data/NAO.txt"
NAO.df <- readr::read_table(file = loc)
ggplot(NAO.df, aes(x = Year, y = May)) + 
  geom_line() + 
  xlab("Year") + 
  ylab("North Atlantic Oscillation (s.d.)")

Let’s return to the mpg data frame. The data frame contains different automobiles by who made it, the model, engine size, mileage, class, etc.

names(mpg)
##  [1] "manufacturer" "model"        "displ"        "year"         "cyl"         
##  [6] "trans"        "drv"          "cty"          "hwy"          "fl"          
## [11] "class"

Let’s start with a scatter plot showing highway mileage on the vertical axis and engine size on the horizontal axis.

ggplot(mpg) +
  geom_point(mapping = aes(x = displ, y = hwy), 
             color = "blue")

You add a third variable, like class, to a two dimensional scatterplot by mapping it to an aesthetic. An aesthetic is a visual property of the objects in our plot. Aesthetics include things like the size, the shape, or the color of our points. You can display a point in different ways by changing the levels of its aesthetic properties (e.g., changing the level by size, color, type).

For example, you map the colors of our points to the class variable to reveal the class of each car.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, 
                           y = hwy, 
                           color = class))

To map an aesthetic to a variable, associate the name of the aesthetic to the name of the variable inside aes(). Note in the previous plot color = was specified outside aes().

ggplot() will automatically assign a unique level of the aesthetic (here a unique color) to each unique value of the variable, a process known as scaling. ggplot() will also add a legend that explains which levels correspond to which values.

The colors show that many of the unusual points are two-seater cars. Sports cars have large engines like SUVs and pickup trucks, but small bodies like midsize and compact cars, which improves their gas mileage.

Facets

One way to add additional variables is with aesthetics. Another way, particularly useful for categorical variables, is to split our plot into facets, subplots that each display one subset of the data.

To facet a plot by a single variable, use facet_wrap(). The first argument of facet_wrap() should be a formula, which you create with ~ (tilde) followed by a variable name (here ‘formula’ is the name of a data structure in R, not a synonym for ‘equation’). The variable that you pass to facet_wrap() should be discrete.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)

To facet a plot on the combination of two variables, add facet_grid() to the plot call. The first argument of facet_grid() is also a formula. This time the formula should contain two variable names separated by a ~ with the first variable named varying in the vertical direction and the second varying in the horizontal direction.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_grid(drv ~ cyl)

Here drv refers to the drive train: front-wheel (f), rear-wheel (r) or 4-wheel (4).

Example: Palmer penguins

Let’s return to the penguins data set. You import it as a data frame using readr::read_csv() function.

loc <- "https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv"
penguins <- readr::read_csv(loc)
## Rows: 344 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): species, island, sex
## dbl (5): bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, year
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(penguins)
## # A tibble: 6 × 8
##   species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex  
##   <chr>   <chr>           <dbl>         <dbl>            <dbl>       <dbl> <chr>
## 1 Adelie  Torge…           39.1          18.7              181        3750 male 
## 2 Adelie  Torge…           39.5          17.4              186        3800 fema…
## 3 Adelie  Torge…           40.3          18                195        3250 fema…
## 4 Adelie  Torge…           NA            NA                 NA          NA <NA> 
## 5 Adelie  Torge…           36.7          19.3              193        3450 fema…
## 6 Adelie  Torge…           39.3          20.6              190        3650 male 
## # … with 1 more variable: year <dbl>

Here you will visualize the relationship between flipper_length_mm and body_mass_g with respect to each species.

https://towardsdatascience.com/penguins-dataset-overview-iris-alternative-9453bb8c8d95

Start by creating a scatter plot with flipper length on the horizontal axis and body mass on the vertical axis.

ggplot(data = penguins) +
  geom_point(aes(x = flipper_length_mm, y = body_mass_g))
## Warning: Removed 2 rows containing missing values (geom_point).

Next, make the color and shape of the points correspond to the species type. Use the colors “darkorange,” “purple,” “cyan4.”

ggplot(data = penguins) +
  geom_point(aes(x = flipper_length_mm, 
                 y = body_mass_g, 
                 color = species,
                 shape = species)) +
  scale_color_manual(values = c("darkorange", "purple", "cyan4"))
## Warning: Removed 2 rows containing missing values (geom_point).

Finally, separate the scatter plots by island.

ggplot(data = penguins) +
  geom_point(aes(x = flipper_length_mm, 
                 y = body_mass_g, 
                 color = species,
                 shape = species)) +
  scale_color_manual(values = c("darkorange", "purple", "cyan4")) +
  facet_wrap(~ island)
## Warning: Removed 2 rows containing missing values (geom_point).

An expository graph

Adding labels and titles turns an exploratory graph into an expository graph. Consider again the mpg dataset and plot highway mileage (hwy) as a function of engine size (displ) with the color of the point layer given by automobile class (class).

ggplot(data = mpg, 
       mapping = aes(x = displ, y = hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth(se = FALSE) +
  labs(title = "Fuel efficiency generally decreases with engine size")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

The graph title should summarize the main finding. Avoid titles that just describe what the plot is, e.g. “A scatter plot of engine displacement vs. fuel economy.” If you need to add more text use subtitles and captions.

  • subtitle = adds additional detail in a smaller font beneath the title.
  • caption = adds text at the bottom right of the plot, often used to describe the source of the data.
ggplot(data = mpg, 
       mapping = aes(displ, hwy)) +
  geom_point(mapping = aes(color = class)) +
  geom_smooth(se = FALSE) +
  labs(title = "Fuel efficiency generally decreases with engine size",
       subtitle = "Two seaters (sports cars) are an exception because of their light weight",
       caption = "Data are from fueleconomy.gov")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Exporting your graph

When you knit to HTML and a plot is produced it gets output as a png file in our project directory.

You can use the Export button under the Plots tab.

Or you can export the file directly using R code. Here the file gets put into our working directory.

png(file = "Test.png")
p1
dev.off()

Note that the function png() opens the device and the function dev.off() closes it.

You list the files in your working directory with the command dir().

CHECK OUT {ggdist}

Thursday, September 21, 2022

Today

  • Making maps

Simple feature data frames

Geographic visualization of data is important to geographers and environmental scientists. There are many tools for geo visualization from full-scale GIS applications such as ArcGIS and QGIS to web-based tools like Google maps.

Using code to make maps (instead of point and click) has the benefit of transparency and reproducibility.

Simple features (simple feature access) refers to a standard that describes how objects in the real world are represented in computers. Emphasis is on the spatial geometry of the objects.

The standard also describes how such objects are stored in and retrieved from databases, and which geometrical operations are defined for them.

The simple feature standard is implemented in spatial databases (such as PostGIS), commercial GIS (e.g., ESRI ArcGIS). R has an implementation in the {sf} package.

One type of spatial data file is called a shapefile. As an example, the U.S. census information at the state and territory level in a file called cb_2018_us_state_5m.shp. https://www.census.gov/geographies/mapping-files/time-series/geo/carto-boundary-file.html

A shapefile encodes points, lines, and polygons in geographic space, and is actually a set of files. Shapefiles appear with a .shp extension and with accompanying files ending in .dbf and .prj.

  • .shp stores the geographic coordinates of the geographic features (e.g. country, state, county)
  • .dbf stores data associated with the geographic features (e.g. unemployment rates)
  • .prj stores information about the projection of the coordinates in the shapefile

To get a shapefile into R all the files need to be in the same folder (directory).

As an example, you import the census data with the sf::st_read() function from the {sf} package. You assign to the object USA.sf the contents of the spatial data frame.

USA.sf <- sf::st_read(dsn = "data/cb_2018_us_state_5m")
## Reading layer `cb_2018_us_state_5m' from data source 
##   `/Users/jameselsner/Desktop/ClassNotes/QG-2022/data/cb_2018_us_state_5m' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 56 features and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -179.1473 ymin: -14.55255 xmax: 179.7785 ymax: 71.35256
## Geodetic CRS:  NAD83

The output includes information about the file. The object shows up in our environment as a data frame with 56 observations and 10 variables.

Each observation is either a state or territory.

The class() function tells us the type of data frame and the names() function list the variable names.

class(USA.sf)
## [1] "sf"         "data.frame"
names(USA.sf)
##  [1] "STATEFP"  "STATENS"  "AFFGEOID" "GEOID"    "STUSPS"   "NAME"    
##  [7] "LSAD"     "ALAND"    "AWATER"   "geometry"

The file is a simple feature (sf) data frame (data.frame). This means it behaves like a data frame but it also contains information about where the observations are located.

The first several columns serve as identifiers. The variable ALAND is the land area (square meters) and the AWATER is the water area (sq. m).

The last column labeled geometry contains information about location stored as a ‘feature.’ The function sf::st_geometry() list the first 5 geometries.

sf::st_geometry(USA.sf)
## Geometry set for 56 features 
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -179.1473 ymin: -14.55255 xmax: 179.7785 ymax: 71.35256
## Geodetic CRS:  NAD83
## First 5 geometries:
## MULTIPOLYGON (((-104.0535 41.15726, -104.0527 4...
## MULTIPOLYGON (((-122.3283 48.02134, -122.3217 4...
## MULTIPOLYGON (((-109.0502 31.48, -109.0498 31.4...
## MULTIPOLYGON (((-104.0577 44.99743, -104.0502 4...
## MULTIPOLYGON (((-106.6455 31.89867, -106.6408 3...

The geometry type in this case is MULTIPOLYGON.

A feature is an object in the real world. Often features will consist of a set of features. For instance, a tree is a feature but a set of trees in a forest is itself a feature. The trees are represented as points while the forest boundary as a polygon.

Features have a geometry describing where on Earth the feature is located. They also have attributes, which describe other properties of the feature.

More on spatial data in a few weeks.

Making a boundary map

The functions in the {ggplot2} package work with simple feature data frames to generate maps using the same grammar.

The important function is geom_sf(). This function draws the geometries.

For example, to draw a map showing the state and territorial boundaries first use ggplot() with the data argument specifying the simple feature data frame USA.sf and then add the geom_sf() function as a layer with the + symbol.

library(ggplot2)

ggplot(data = USA.sf) +
  geom_sf()

Note: you don’t need the mapping = aes() function. The mapping is assumed based on the fact that there is a geometry column in the simple feature data frame.

The geom_sf() function maps the east-west coordinate to the x aesthetic and the north-south coordinate to the y aesthetic.

The map is not very informative. Let’s zoom into the contiguous states.

What states/territories are there in the data frame USA.sf?

USA.sf$NAME
##  [1] "Nebraska"                                    
##  [2] "Washington"                                  
##  [3] "New Mexico"                                  
##  [4] "South Dakota"                                
##  [5] "Texas"                                       
##  [6] "California"                                  
##  [7] "Kentucky"                                    
##  [8] "Ohio"                                        
##  [9] "Alabama"                                     
## [10] "Georgia"                                     
## [11] "Wisconsin"                                   
## [12] "Oregon"                                      
## [13] "Pennsylvania"                                
## [14] "Mississippi"                                 
## [15] "Missouri"                                    
## [16] "North Carolina"                              
## [17] "Oklahoma"                                    
## [18] "West Virginia"                               
## [19] "New York"                                    
## [20] "Indiana"                                     
## [21] "Kansas"                                      
## [22] "Idaho"                                       
## [23] "Nevada"                                      
## [24] "Vermont"                                     
## [25] "Montana"                                     
## [26] "Minnesota"                                   
## [27] "North Dakota"                                
## [28] "Hawaii"                                      
## [29] "Arizona"                                     
## [30] "Delaware"                                    
## [31] "Rhode Island"                                
## [32] "Colorado"                                    
## [33] "Utah"                                        
## [34] "Virginia"                                    
## [35] "Wyoming"                                     
## [36] "Louisiana"                                   
## [37] "Michigan"                                    
## [38] "Massachusetts"                               
## [39] "Florida"                                     
## [40] "United States Virgin Islands"                
## [41] "Connecticut"                                 
## [42] "New Jersey"                                  
## [43] "Maryland"                                    
## [44] "South Carolina"                              
## [45] "Maine"                                       
## [46] "New Hampshire"                               
## [47] "District of Columbia"                        
## [48] "Guam"                                        
## [49] "Commonwealth of the Northern Mariana Islands"
## [50] "American Samoa"                              
## [51] "Iowa"                                        
## [52] "Puerto Rico"                                 
## [53] "Arkansas"                                    
## [54] "Tennessee"                                   
## [55] "Illinois"                                    
## [56] "Alaska"

To zoom in you keep only rows corresponding to states (in the lower 48) from the simple feature data frame.

Recall to pick out rows in a data frame you use the dplyr::filter() function from the {dplyr} package.

First you need to get a list of all the states you want to keep. The state.name vector object contains all 50 state names. This is like the month.abb vector you saw earlier.

state.name
##  [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"      
##  [5] "California"     "Colorado"       "Connecticut"    "Delaware"      
##  [9] "Florida"        "Georgia"        "Hawaii"         "Idaho"         
## [13] "Illinois"       "Indiana"        "Iowa"           "Kansas"        
## [17] "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
## [21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
## [25] "Missouri"       "Montana"        "Nebraska"       "Nevada"        
## [29] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"      
## [33] "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
## [37] "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina"
## [41] "South Dakota"   "Tennessee"      "Texas"          "Utah"          
## [45] "Vermont"        "Virginia"       "Washington"     "West Virginia" 
## [49] "Wisconsin"      "Wyoming"

Let’s remove the rows corresponding to the names "Alaska" and "Hawaii". These are elements 2 and 11 so you create a new vector object called sn containing only the names of the lower 48.

sn <- state.name[c(-2, -11)]
sn
##  [1] "Alabama"        "Arizona"        "Arkansas"       "California"    
##  [5] "Colorado"       "Connecticut"    "Delaware"       "Florida"       
##  [9] "Georgia"        "Idaho"          "Illinois"       "Indiana"       
## [13] "Iowa"           "Kansas"         "Kentucky"       "Louisiana"     
## [17] "Maine"          "Maryland"       "Massachusetts"  "Michigan"      
## [21] "Minnesota"      "Mississippi"    "Missouri"       "Montana"       
## [25] "Nebraska"       "Nevada"         "New Hampshire"  "New Jersey"    
## [29] "New Mexico"     "New York"       "North Carolina" "North Dakota"  
## [33] "Ohio"           "Oklahoma"       "Oregon"         "Pennsylvania"  
## [37] "Rhode Island"   "South Carolina" "South Dakota"   "Tennessee"     
## [41] "Texas"          "Utah"           "Vermont"        "Virginia"      
## [45] "Washington"     "West Virginia"  "Wisconsin"      "Wyoming"

Now you filter the USA.sf data frame keeping only the rows that are listed in the vector of state names. Assign this spatial data frame the name USA_48.sf.

USA_48.sf <- USA.sf |>
  dplyr::filter(NAME %in% sn)

The function %in% finds only the rows in USA.sf with NAME equal to the names in the vector sn and the dplyr::filter() function chooses these rows.

Now redraw the map using the USA_48.sf simple feature data frame.

ggplot(data = USA_48.sf) +
  geom_sf()

Since the map is a ggplot() object, it is modified like any other ggplot() graph. For example, you change the color of the map and the borders as follows.

ggplot(data = USA_48.sf) +
  geom_sf(fill = "skyblue", 
          color = "gray70")

You can filter by state. Here you create a new simple feature data frame called Wisconsin.sf then draw the boundary.

Wisconsin.sf <- USA_48.sf |>
  dplyr::filter(NAME == "Wisconsin")

ggplot(data = Wisconsin.sf) +
  geom_sf(fill = "palegreen", 
          color = "black")

Where is the state of Nebraska? Repeat but fill in Nebraska using the color brown.

Nebraska.sf <- USA_48.sf |>
  dplyr::filter(NAME == "Nebraska")

ggplot(data = USA_48.sf) +
  geom_sf() +
  geom_sf(data = Nebraska.sf, 
          fill = "brown")

You add layers with the + symbol as before.

Boundaries serve as the background canvas for spatial data analysis. You usually need to add data to this canvas. Depending on the type of data, you either overlay it on top of the boundaries or use it to fill in the areas between the boundaries.

Fills

Choropleth maps (heat maps, thematic maps) map data values from a column in the simple feature data frame to the fill aesthetic. The aesthetic assigns colors to the various map areas (e.g. countries, states, counties, zip codes).

Recall the column labeled AWATER contains the water area in square meters. Since the values are very large first divide by million (10^9) to get the values in 1000s of square kilometers. This is done with the mutate() function.

USA_48.sf <- USA_48.sf |>
  dplyr::mutate(WaterArea_km2 = AWATER/10^9)

Then create a choropleth map showing the water area by filling the area between the state borders with a color. This is done using the aes() function and the argument fill = WaterArea_km2.

ggplot(data = USA_48.sf) +
  geom_sf(aes(fill = WaterArea_km2))

Note how this differs from just drawing the boundaries. In this case you use the aes() function with the fill aesthetic.

The map is not very informative. large water area of Michigan which includes Lakes Michigan, Superior, and Huron has by far the most water area with most other states have a lot less.

To change that use the logarithm of the area. The base 10 logarithm is 0 when the value is 1, 1 when the value is 10, 2 when the value is 100 and so on. This is seen with the log10() function.

log10(c(1, 10, 100, 1000, 10000))
## [1] 0 1 2 3 4

You convert the area to logarithms with the log10() function inside the aes() function as follows.

ggplot(data = USA_48.sf) +
  geom_sf(aes(fill = log10(WaterArea_km2))) 

Another way to make the map more informative is to convert the continuous variable to a discrete variable and map the discrete values.

In the {dplyr} package the cut_interval() function takes a continuous variable and makes n groups each having an equal range, cut_number() makes n groups with (approximately) equal numbers of observations; cut_width() makes groups of equal width.

As an example, if you want a map with 5 colors with each color representing a state having a similar amount of water area you would use cut_number() and specify n = 5. You do this with the mutate() function to create a new variable (column) called WaterArea_cut.

USA_48.sf <- USA_48.sf |>
  dplyr::mutate(WaterArea_cut = cut_number(WaterArea_km2, n = 5))
str(USA_48.sf)
## Classes 'sf' and 'data.frame':   48 obs. of  12 variables:
##  $ STATEFP      : chr  "31" "53" "35" "46" ...
##  $ STATENS      : chr  "01779792" "01779804" "00897535" "01785534" ...
##  $ AFFGEOID     : chr  "0400000US31" "0400000US53" "0400000US35" "0400000US46" ...
##  $ GEOID        : chr  "31" "53" "35" "46" ...
##  $ STUSPS       : chr  "NE" "WA" "NM" "SD" ...
##  $ NAME         : chr  "Nebraska" "Washington" "New Mexico" "South Dakota" ...
##  $ LSAD         : chr  "00" "00" "00" "00" ...
##  $ ALAND        : num  1.99e+11 1.72e+11 3.14e+11 1.96e+11 6.77e+11 ...
##  $ AWATER       : num  1.37e+09 1.26e+10 7.29e+08 3.38e+09 1.90e+10 ...
##  $ geometry     :sfc_MULTIPOLYGON of length 48; first list element: List of 1
##   ..$ :List of 1
##   .. ..$ : num [1:1516, 1:2] -104 -104 -104 -104 -104 ...
##   ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
##  $ WaterArea_km2: num  1.372 12.559 0.729 3.383 19.006 ...
##  $ WaterArea_cut: Factor w/ 5 levels "[0.489,1.38]",..: 1 5 1 3 5 5 2 4 3 3 ...
##  - attr(*, "sf_column")= chr "geometry"
##  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA NA ...
##   ..- attr(*, "names")= chr [1:11] "STATEFP" "STATENS" "AFFGEOID" "GEOID" ...

Essentially you added a new factor variable called WaterArea_cut with five levels corresponding to equal number of water area values.

You can go directly to the mapping as follows.

 ggplot(data = USA_48.sf) +
    geom_sf(aes(fill = WaterArea_cut))

Make a choropleth map displaying the ratio of water area to land area (ALAND) by state.

ggplot(data = USA_48.sf) +
  geom_sf(aes(fill = AWATER/ALAND * 100))

Overlays

The USA_48.sf simple feature data frame uses longitude and latitude for it’s coordinate reference system (CRS). All spatial data frames have a CRS.

To see what CRS a simple feature data frame use the sf::st_crs() function.

sf::st_crs(USA_48.sf)
## Coordinate Reference System:
##   User input: NAD83 
##   wkt:
## GEOGCRS["NAD83",
##     DATUM["North American Datum 1983",
##         ELLIPSOID["GRS 1980",6378137,298.257222101,
##             LENGTHUNIT["metre",1]]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["latitude",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["longitude",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     ID["EPSG",4269]]

The Coordinate Reference System information including the EPSG code (4269) and the corresponding GEOGCS, DATUM, etc is given in well-known text (wkt).

Here it specifies a geographic reference system with longitude and latitude and a datum (North American 1983) that describes the sea-level shape of the planet as an ellipsoid.

Because the CRS uses longitude and latitude you can add locations by specifying the geographic coordinates.

For example, suppose you want to overlay the locations of two cities on the map. First you create a data frame containing the longitudes, latitudes, and names of the locations.

Cities.df <- data.frame(long = c(-84.2809, -87.9735),
                        lat = c(30.4381,43.0115),
                        names = c("Tallahassee", "Milwaukee"))
class(Cities.df)
## [1] "data.frame"

Next you draw the map as before but add the locations with a point layer and label the locations with a text layer.

ggplot(data = USA_48.sf) +
  geom_sf(color = "gray80") +
  geom_point(data = Cities.df, 
             mapping = aes(x = long, y = lat), 
             size = 2) +
  geom_text(data = Cities.df,
            mapping = aes(x = long, y = lat, label = names),
            nudge_y = 1)

As another example, let’s consider the airports data frame from the {nycflights13} package. The data frame includes information on 1458 airports in the United States including their location with latitude and longitude.

library(nycflights13)
airports
## # A tibble: 1,458 × 8
##    faa   name                             lat    lon   alt    tz dst   tzone    
##    <chr> <chr>                          <dbl>  <dbl> <dbl> <dbl> <chr> <chr>    
##  1 04G   Lansdowne Airport               41.1  -80.6  1044    -5 A     America/…
##  2 06A   Moton Field Municipal Airport   32.5  -85.7   264    -6 A     America/…
##  3 06C   Schaumburg Regional             42.0  -88.1   801    -6 A     America/…
##  4 06N   Randall Airport                 41.4  -74.4   523    -5 A     America/…
##  5 09J   Jekyll Island Airport           31.1  -81.4    11    -5 A     America/…
##  6 0A9   Elizabethton Municipal Airport  36.4  -82.2  1593    -5 A     America/…
##  7 0G6   Williams County Airport         41.5  -84.5   730    -5 A     America/…
##  8 0G7   Finger Lakes Regional Airport   42.9  -76.8   492    -5 A     America/…
##  9 0P2   Shoestring Aviation Airfield    39.8  -76.6  1000    -5 U     America/…
## 10 0S9   Jefferson County Intl           48.1 -123.    108    -8 A     America/…
## # … with 1,448 more rows

Each row is an airport and the location of the airport is given in the columns lat and lon. You can make a map without boundaries by drawing a scatter plot with x = lon and y = lat.

ggplot(data = airports, 
       mapping = aes(x = lon, y = lat)) +
  geom_point()

If you only want airports within the continental United States, you first plot the USA_48.sf boundaries and then add the airport locations as a separate point layer and the use the coord_sf() function specifying the limits of the plot in the longitude direction (xlim) and the latitude direction (ylim).

ggplot(data = USA_48.sf) + 
  geom_sf(color = "gray80") + 
  geom_point(data = airports, 
             aes(x = lon, y = lat)) +
  coord_sf(xlim = c(-130, -60),
           ylim = c(20, 50)) +
  theme_minimal()

Alternatively, you can use sf::st_as_sf() to convert the airports data frame to a simple features data frame. The argument coords = tells sf::st_as_sf() which columns contain the geographic coordinates of each airport. You also set the CRS using the sf::st_crs() function and use the EPSG code corresponding to a geographic CRS.

airports.sf <- sf::st_as_sf(airports, 
                        coords = c("lon", "lat"),
                        crs = 4269)
airports.sf
## Simple feature collection with 1458 features and 6 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -176.646 ymin: 19.72137 xmax: 174.1136 ymax: 72.27083
## Geodetic CRS:  NAD83
## # A tibble: 1,458 × 7
##    faa   name                    alt    tz dst   tzone             geometry
##  * <chr> <chr>                 <dbl> <dbl> <chr> <chr>          <POINT [°]>
##  1 04G   Lansdowne Airport      1044    -5 A     Amer… (-80.61958 41.13047)
##  2 06A   Moton Field Municipa…   264    -6 A     Amer… (-85.68003 32.46057)
##  3 06C   Schaumburg Regional     801    -6 A     Amer… (-88.10124 41.98934)
##  4 06N   Randall Airport         523    -5 A     Amer… (-74.39156 41.43191)
##  5 09J   Jekyll Island Airport    11    -5 A     Amer… (-81.42778 31.07447)
##  6 0A9   Elizabethton Municip…  1593    -5 A     Amer… (-82.17342 36.37122)
##  7 0G6   Williams County Airp…   730    -5 A     Amer… (-84.50678 41.46731)
##  8 0G7   Finger Lakes Regiona…   492    -5 A     Amer… (-76.78123 42.88356)
##  9 0P2   Shoestring Aviation …  1000    -5 U     Amer… (-76.64719 39.79482)
## 10 0S9   Jefferson County Intl   108    -8 A     Amer… (-122.8106 48.05381)
## # … with 1,448 more rows

To graph the points on the map, you use a second geom_sf().

ggplot() + 
  geom_sf(data = USA_48.sf) + 
  geom_sf(data = airports.sf, shape = 1) +
  coord_sf(xlim = c(-130, -60),
           ylim = c(20, 50))

You can change the size or type of symbols on the map. For instance, you can draw a bubble plot (also known as a proportional symbol map) and encode the altitude of the airport through the size = aesthetic.

ggplot() + 
  geom_sf(data = USA_48.sf) + 
  geom_sf(data = airports.sf, aes(size = alt), 
          fill = "grey", color = "black", alpha = .2) +
  coord_sf(xlim = c(-130, -60),
           ylim = c(20, 50)) +
  scale_size_area(guide = FALSE)
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.

Circle area is proportional to the airport’s altitude (in feet).

Map projections

Depending on how a curved surface is projected onto a 2-D surface (map), at least some features will be distorted. The coord_sf() function package provides a way to adjust projections.

With a geographic projection the longitudes and latitudes are treated as x (horizontal) and y (vertical) coordinates.

Consider again the boundary map of the lower 48 states. Here we get the boundary file using the us_states() function from the {USAboundaries} package and use the filter() function to remove rows corresponding to Hawaii, Alaska, and Puerto Rico.

USA_48.sf <- USAboundaries::us_states() |>
   filter(!state_name %in% c("Hawaii", "Alaska", "Puerto Rico"))

Here you first assign the map to an object called base_map and then render the map to the plot device by typing the object name.

base_map <- ggplot(data = USA_48.sf) +
              geom_sf()
base_map

Note the equal spacing between the latitudes and between the longitudes. 1 degree latitude distance equals 1 degree longitude distance. This is called a carto-cartesian (geographic) projection.

You change the projection by specifying the CRS. For example to change the base map to have a Mercator projection you use the coord_sf() function with crs = "+proj=merc" (or equivalently crs = 3857, which uses the EPSG code 3857 for world Mercator projection).

base_map +
  coord_sf(crs = "+proj=merc") +
  ggtitle("Mercator projection")

base_map +
  coord_sf(crs = 3857) +
  ggtitle("Mercator projection")

Note the distance between the latitudes increases with increasing latitude. Note also the projection is applied to the rendered map and not the simple feature data frame used to create it.

The Mercator projection is widely used, but it makes areas closer to the poles appear larger than the same areas closer to the equator. Greenland appears as large as the continent of Africa. In reality Africa is 14 times larger in area than Greenland.

Other coordinate systems require specification of the standard lines, or lines that define areas of the surface of the map that are tangent to the globe. These include Gall-Peters, Albers equal-area, and Lambert azimuthal.

base_map +
  coord_sf(crs = "+proj=cea +lon_0=0 +lat_ts=45") +
  ggtitle("Gall-Peters projection")

With this projection states having the same area appear with the same size, but the boundary shapes are distorted.

Distortions are smallest between latitudes defined by the Albers equal-area projection.

base_map +
  coord_sf(crs = "+proj=aea +lat_1=25 +lat_2=50 +lon_0=-100") +
  ggtitle("Albers equal-area projection")

USA Contiguous Albers Equal Area Conic, USGS (EPSG = 5070 or 102003)

See Kyle Walker’s get CRS See maptiles package https://github.com/riatelab/maptiles/

Why map projections matter. Clip from The West Wing. https://youtu.be/vVX-PrBRtTY

Thursday, September 21, 2022

Today

  • Making maps

Maps with tmap

The {tmap} package has functions for creating thematic maps. The syntax is like the syntax of the functions in {ggplot2}. The functions work with a variety of spatial data.

Consider the simple feature data frame called World from the {tmap} package.

library(tmap)

data("World")
str(World)
## Classes 'sf' and 'data.frame':   177 obs. of  16 variables:
##  $ iso_a3      : Factor w/ 177 levels "AFG","AGO","ALB",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ name        : Factor w/ 177 levels "Afghanistan",..: 1 4 2 166 6 7 5 56 8 9 ...
##  $ sovereignt  : Factor w/ 171 levels "Afghanistan",..: 1 4 2 159 6 7 5 52 8 9 ...
##  $ continent   : Factor w/ 8 levels "Africa","Antarctica",..: 3 1 4 3 8 3 2 7 6 4 ...
##  $ area        : Units: [km^2] num  652860 1246700 27400 71252 2736690 ...
##  $ pop_est     : num  28400000 12799293 3639453 4798491 40913584 ...
##  $ pop_est_dens: num  43.5 10.3 132.8 67.3 15 ...
##  $ economy     : Factor w/ 7 levels "1. Developed region: G7",..: 7 7 6 6 5 6 6 6 2 2 ...
##  $ income_grp  : Factor w/ 5 levels "1. High income: OECD",..: 5 3 4 2 3 4 2 2 1 1 ...
##  $ gdp_cap_est : num  784 8618 5993 38408 14027 ...
##  $ life_exp    : num  59.7 NA 77.3 NA 75.9 ...
##  $ well_being  : num  3.8 NA 5.5 NA 6.5 4.3 NA NA 7.2 7.4 ...
##  $ footprint   : num  0.79 NA 2.21 NA 3.14 2.23 NA NA 9.31 6.06 ...
##  $ inequality  : num  0.427 NA 0.165 NA 0.164 ...
##  $ HPI         : num  20.2 NA 36.8 NA 35.2 ...
##  $ geometry    :sfc_MULTIPOLYGON of length 177; first list element: List of 1
##   ..$ :List of 1
##   .. ..$ : num [1:69, 1:2] 61.2 62.2 63 63.2 64 ...
##   ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
##  - attr(*, "sf_column")= chr "geometry"
##  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA NA ...
##   ..- attr(*, "names")= chr [1:15] "iso_a3" "name" "sovereignt" "continent" ...

The spatial data frame contains socioeconomic indicators from 177 countries around the world. Each row is one country’s indicators.

You make a map by first specifying the spatial data frame using the tm_shape() function and then you add a layer consistent with the geometry.

For example, if you want a map showing the index of happiness (column name HPI) by country, use the tm_shape() function to identify the spatial data frame World then add a fill layer with the tm_polygons() function.

The fill is specified by the argument col = indicating the specific column from the data frame. Here use HPI.

tm_shape(shp = World) +
    tm_polygons(col = "HPI")

The tm_polygons() function with the argument col = colors the countries based on the values in the column HPI of the World data frame.

Map layers are added with the + operator.

Caution: the column in the data frame World must be specified using quotes "HPI". This is different from the functions in the {ggplot2} package.

To show two thematic maps together each with a different variable, specify col = c("HPI", "well_being")

The tm_polygons() function splits the values in the specified column into meaningful groups (here 8) and countries with missing values (NA) values are colored gray.

More (or fewer) intervals can be specified with the n = argument, but the cutoff values are chosen at appropriate places.

Tornado data

Consider the tornado data from the U.S. Storm Prediction Center (SPC). It is downloaded as a shapefile in the directory data/1950-2018-torn-aspath.

A shapefile is imported with the sf::st_read() function from the {sf} package.

Tornadoes.sf <- sf::st_read(dsn = "data/1950-2018-torn-aspath")
## Reading layer `1950-2018-torn-aspath' from data source 
##   `/Users/jameselsner/Desktop/ClassNotes/QG-2022/data/1950-2018-torn-aspath' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 63645 features and 22 fields
## Geometry type: LINESTRING
## Dimension:     XY
## Bounding box:  xmin: -163.53 ymin: 18.13 xmax: -64.9 ymax: 61.02
## Geodetic CRS:  WGS 84

The assigned file is a simple feature data frame with 63645 features (observations) and 23 fields (variables).

Each row (observation) is a unique tornado.

Look inside the simple feature data frame with the glimpse() function from the {dplyr} package.

dplyr::glimpse(Tornadoes.sf)
## Rows: 63,645
## Columns: 23
## $ om       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
## $ yr       <dbl> 1950, 1950, 1950, 1950, 1950, 1950, 1950, 1950, 1950, 1950, 1…
## $ mo       <dbl> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ dy       <dbl> 3, 3, 3, 13, 25, 25, 26, 11, 11, 11, 11, 12, 12, 12, 12, 12, …
## $ date     <chr> "1950-01-03", "1950-01-03", "1950-01-03", "1950-01-13", "1950…
## $ time     <chr> "11:00:00", "11:55:00", "16:00:00", "05:25:00", "19:30:00", "…
## $ tz       <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3…
## $ st       <chr> "MO", "IL", "OH", "AR", "MO", "IL", "TX", "TX", "TX", "TX", "…
## $ stf      <dbl> 29, 17, 39, 5, 29, 17, 48, 48, 48, 48, 48, 48, 48, 48, 48, 28…
## $ stn      <dbl> 1, 2, 1, 1, 2, 3, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 10, 2, 1, …
## $ mag      <dbl> 3, 3, 1, 3, 2, 2, 2, 2, 3, 2, 2, 2, 1, 2, 1, 2, 1, 3, 2, 4, 2…
## $ inj      <dbl> 3, 3, 1, 1, 5, 0, 2, 0, 12, 5, 6, 8, 0, 0, 32, 2, 0, 15, 0, 7…
## $ fat      <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 3, 0, 3, 0, 18, …
## $ loss     <dbl> 6, 5, 4, 3, 5, 5, 0, 4, 4, 5, 5, 4, 4, 4, 5, 4, 0, 5, 3, 5, 5…
## $ closs    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ slat     <dbl> 38.77, 39.10, 40.88, 34.40, 37.60, 41.17, 26.88, 29.42, 29.67…
## $ slon     <dbl> -90.22, -89.30, -84.58, -94.37, -90.68, -87.33, -98.12, -95.2…
## $ elat     <dbl> 38.8300, 39.1200, 40.8801, 34.4001, 37.6300, 41.1701, 26.8800…
## $ elon     <dbl> -90.0300, -89.2300, -84.5799, -94.3699, -90.6500, -87.3299, -…
## $ len      <dbl> 9.5, 3.6, 0.1, 0.6, 2.3, 0.1, 4.7, 9.9, 12.0, 4.6, 4.5, 8.0, …
## $ wid      <dbl> 150, 130, 10, 17, 300, 100, 133, 400, 1000, 100, 67, 833, 233…
## $ fc       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ geometry <LINESTRING [°]> LINESTRING (-90.22 38.77, -..., LINESTRING (-89.3 …

The first 22 columns are variables (attributes). The last column contains the geometry. Information in the geometry column is in well-known text (WKT) format.

Each tornado is a coded as a LINESTRING with a start and end location. This is where the tm_shape() function looks for the geographic information.

Here you make a map showing the tracks of all the tornadoes since 2011. First filter the data frame keeping only tornadoes occurring after the year (yr) 2010.

TornadoesSince2011.sf <- 
  Tornadoes.sf |>
  dplyr::filter(yr >= 2011) 

Next get a file containing the boundaries of the lower 48 states.

USA_48.sf <- USAboundaries::us_states() |>
   dplyr::filter(!state_name %in% c("Hawaii", "Alaska", "Puerto Rico"))

Then use the tm_shape() function together with the tm_borders() layer to draw the boundaries before adding the tornadoes. The tornadoes are in a separate spatial data frame so you use the tm_shape() function together with the tm_lines() layer.

tm_shape(shp = USA_48.sf, projection = 5070) +
  tm_borders() +
tm_shape(shp = TornadoesSince2011.sf) +
    tm_lines(col = "red")

The objects named TornadoesSince2011.sf and USA_48.sf are simple feature data frames. You map variables in the data frames as layers with successive calls to the tm_shape() function.

The default projection is geographic (latitude-longitude) which is changed using the projection = argument and specifying a EPSG number (or proj4 string). Here you use 5070 corresponding to USA Contiguous Albers Equal Area Conic, USGS (EPSG = 5070 or 102003).

You make the map interactive by first turning on the "view" mode with the tmap_mode() function before running the code.

tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(USA_48.sf) +
  tm_borders() +
tm_shape(TornadoesSince2011.sf) +
    tm_lines(col = "red")

You can now zoom, pan, and change the background layers.

Switch back to plot mode by typing.

tmap_mode("plot")
## tmap mode set to plotting

Map the frequency of tornadoes by state

Suppose you want to show the number of tornadoes originating in each state on a map. You first need to prepare the data.

You do this with a series of then statements connected by pipes (|>). Start by assigning to the object TornadoeCountsByState.df the contents of Tornadoes.sf then remove the the geometry column, then remove states outside lower 48 using the dplyr::filter() function, then group by state, then summarize creating a colunm called nT that keeps track of the number of rows (dplyr::n()), then change the column name of st to state_abbr to match the state name abbreviation in the USA_48.sf data frame.

TornadoCountsByState.df <- Tornadoes.sf |>
  sf::st_drop_geometry() |>
  dplyr::filter(st != "PR" & st != "HI" & st != "AK") |>
  dplyr::group_by(st) |>
  dplyr::summarize(nT = dplyr::n()) |>
  dplyr::rename(state_abbr = st)

dplyr::glimpse(TornadoCountsByState.df)
## Rows: 49
## Columns: 2
## $ state_abbr <chr> "AL", "AR", "AZ", "CA", "CO", "CT", "DC", "DE", "FL", "GA",…
## $ nT         <int> 2143, 1809, 250, 436, 2174, 104, 2, 61, 3381, 1652, 2570, 2…

The resulting data frame contains the grouped-by column state_abbr (origin state) and the corresponding number of tornadoes. There were 459 tornadoes in Alabama since 2011, 255 in Arkansas, etc.

Next you need to join the new data frame with the spatial data frame. You join the TornadoCountsByState.df data frame with the map simple feature data frame USA_48.sf using the dplyr::left_join() function and recycling the name.

USA_48.sf <-dplyr::left_join(USA_48.sf,
                             TornadoCountsByState.df,
                             by = "state_abbr") 

names(USA_48.sf)
##  [1] "statefp"           "statens"           "affgeoid"         
##  [4] "geoid"             "stusps"            "name"             
##  [7] "lsad"              "aland"             "awater"           
## [10] "state_name"        "state_abbr"        "jurisdiction_type"
## [13] "nT"                "geometry"

Notice that you now have a new column in the spatial data frame USA_48.sf named nT that contains the number of tornadoes in that state.

Next you create a draft map to see if things look correct.

tm_shape(shp = USA_48.sf, projection = 5070) +
  tm_polygons(col = "nT", 
           title = "Tornado Counts",
           palette = "Oranges")

Tornadoes are most common in the southern Great Plains into the Southeast.

You improve the defaults with additional layers including text, compass, and scale bar. The last layer is the print view.

tm_shape(shp = USA_48.sf, projection = 5070) +
  tm_polygons(col = "nT", 
              border.col = "gray70",
              title = "Tornado Counts",
              palette = "Oranges") +
  tm_text("nT", size = .5) +
  tm_compass() + 
  tm_scale_bar(lwd = .5)

The format of the {tmap} objects (meoms) are like those of the {ggplot2} geometric objects (geoms) making it easy to quickly map your data. Fine details are worked out in production.

More information?

Geometry calculations

Spatial data analysis often requires calculations on the geometry. Two of the most common are computing centroids (geographic centers) and buffers.

Geometry calculations should be done on projected coordinates. To see what CRS the simple feature data frame has use st_crs().

sf::st_crs(USA_48.sf)
## Coordinate Reference System:
##   User input: EPSG:4326 
##   wkt:
## GEOGCRS["WGS 84",
##     DATUM["World Geodetic System 1984",
##         ELLIPSOID["WGS 84",6378137,298.257223563,
##             LENGTHUNIT["metre",1]]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["geodetic latitude (Lat)",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["geodetic longitude (Lon)",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     USAGE[
##         SCOPE["Horizontal component of 3D system."],
##         AREA["World."],
##         BBOX[-90,-180,90,180]],
##     ID["EPSG",4326]]

Note the length unit (LENGTHUNIT[]) is meter.

Here transform the CRS of the USA_48.sf simple feature data frame to a U.S. National Atlas equal area (EPSG: 2163) and then check it.

USA_48.sf <- USA_48.sf |>
  sf::st_transform(crs = 2163)

sf::st_crs(USA_48.sf)
## Coordinate Reference System:
##   User input: EPSG:2163 
##   wkt:
## PROJCRS["NAD27 / US National Atlas Equal Area",
##     BASEGEOGCRS["NAD27",
##         DATUM["North American Datum 1927",
##             ELLIPSOID["Clarke 1866",6378206.4,294.978698213898,
##                 LENGTHUNIT["metre",1]]],
##         PRIMEM["Greenwich",0,
##             ANGLEUNIT["degree",0.0174532925199433]],
##         ID["EPSG",4267]],
##     CONVERSION["US National Atlas Equal Area",
##         METHOD["Lambert Azimuthal Equal Area (Spherical)",
##             ID["EPSG",1027]],
##         PARAMETER["Latitude of natural origin",45,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8801]],
##         PARAMETER["Longitude of natural origin",-100,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8802]],
##         PARAMETER["False easting",0,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8806]],
##         PARAMETER["False northing",0,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8807]]],
##     CS[Cartesian,2],
##         AXIS["easting (X)",east,
##             ORDER[1],
##             LENGTHUNIT["metre",1]],
##         AXIS["northing (Y)",north,
##             ORDER[2],
##             LENGTHUNIT["metre",1]],
##     USAGE[
##         SCOPE["Statistical analysis."],
##         AREA["United States (USA) - onshore and offshore."],
##         BBOX[15.56,167.65,74.71,-65.69]],
##     ID["EPSG",9311]]

The centroid calculation locates the center of geographic objects representing the center of mass for the spatial object (think of balancing a plate on your finger).

You calculate the geographic centroid of each of the lower 48 states with the st_centroid() function.

geo_centroid.sf <- sf::st_centroid(USA_48.sf)
## Warning in st_centroid.sf(USA_48.sf): st_centroid assumes attributes are
## constant over geometries of x

The result is a simple feature data frame where the geometry is a single point for each state. You keep track of the fact that this is a simple feature data frame by using an object name that includes appends with .sf.

The warning tells you that the attributes in the new simple feature data frame may not make sense with the new geometry.

For example, compare the first two rows of the two simple feature data frames.

head(geo_centroid.sf, n = 2)
## Simple feature collection with 2 features and 13 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -1711894 ymin: -666720.2 xmax: 791612.5 ymax: 7216.145
## Projected CRS: NAD27 / US National Atlas Equal Area
##   statefp  statens    affgeoid geoid stusps       name lsad        aland
## 1      06 01779778 0400000US06    06     CA California   00 403671196038
## 2      55 01779806 0400000US55    55     WI  Wisconsin   00 140292246684
##        awater state_name state_abbr jurisdiction_type   nT
## 1 20294133830 California         CA             state  436
## 2 29343721650  Wisconsin         WI             state 1380
##                     geometry
## 1 POINT (-1711894 -666720.2)
## 2  POINT (791612.5 7216.145)
head(USA_48.sf, n = 2)
## Simple feature collection with 2 features and 13 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -2036903 ymin: -1242190 xmax: 1027143 ymax: 269562.7
## Projected CRS: NAD27 / US National Atlas Equal Area
##   statefp  statens    affgeoid geoid stusps       name lsad        aland
## 1      06 01779778 0400000US06    06     CA California   00 403671196038
## 2      55 01779806 0400000US55    55     WI  Wisconsin   00 140292246684
##        awater state_name state_abbr jurisdiction_type   nT
## 1 20294133830 California         CA             state  436
## 2 29343721650  Wisconsin         WI             state 1380
##                         geometry
## 1 MULTIPOLYGON (((-1719948 -1...
## 2 MULTIPOLYGON (((1017108 129...

The land area (aland) makes sense when the geometry is MULTIPOLYGON it is less congruent when the geometry is POINT.

You map the points using the tm_dots() function after first mapping the state borders.

tm_shape(shp = USA_48.sf) +
  tm_borders(col = "gray70") +
tm_shape(shp = geo_centroid.sf) +
  tm_dots(size = 1,
          col = "black")

Buffers are polygons representing the area within a given distance of a geometric feature. Regardless of whether the feature is a point, a line, or a polygon.

The function sf::st_buffer() computes the buffer and you set the distance with the dist = argument. Here you create a new simple feature data frame with only the state of Florida.

You then compute a 50 km (50,000 meters) buffer and save the resulting polygon

FL.sf <- USA_48.sf |>
           dplyr::filter(state_abbr == "FL")

FL_buffer.sf <- sf::st_buffer(FL.sf, 
                              dist = 50000)

Create a map containing the state border, the 50 km buffer, and the centroid. Include a compass arrow and a scale bar.

tm_shape(FL_buffer.sf) +
  tm_borders(col = "gray70") +
tm_shape(FL.sf) +
  tm_borders() +
tm_shape(geo_centroid.sf) +
  tm_dots(size = 2) +
tm_compass(position = c("left", "bottom")) + 
tm_scale_bar(text.size = 1, position = c("left", "bottom"))

The result is a map that could serve as a map of your study area (usually Figure 1 in scientific report).

Raster maps

The package {ggmap} retrieves raster map tiles (groups of pixels) from services like Google Maps and plots them using the {ggplot2} grammar.

Map tiles are rasters as static image files generated by the mapping service. You do not need data files containing information on things like scale, projection, boundaries, etc. because that information is created by the map tile.

This limits the ability to redraw or change the appearance of the map but it allows for easy overlays of data onto the map.

Get map images

You get map tiles with the ggmap::get_map() function from the {ggmap} package. You specify the bounding box (or the center and zoom). The bounding box requires the left-bottom and right-top corners of the region specified as longitude and latitude in decimal degrees.

For instance, to obtain a map of Tallahassee from the stamen mapping service you first set the bounding box (left-bottom corner as -84.41, 30.37 and right-top corner as -84.19, 30.55) then use the ggmap::get_stamenmap() function with a zoom level of 12.

library(ggmap)

TLH_bb <- c(left = -84.41,
            bottom = 30.37,
            right = -84.19,
            top = 30.55)

TLH_map <- ggmap::get_stamenmap(bbox = TLH_bb,
                                zoom = 12)
TLH_map
## 609x641 terrain map image from Stamen Maps. 
## See ?ggmap to plot it.

The saved object (TLH_map) is a raster map specified by the class ggmap.

To view the map, use ggmap() function.

ggmap(TLH_map)

The zoom = argument in the get_stamenmap() function controls the level of detail. The larger the number, the greater the detail.

Trial and error helps you decide on the appropriate level of detail depending on the data you need to visualize. Use boxfinder to determine the exact longitude/latitude coordinates for the bounding box you wish to obtain.

Or you can use the tmaptools::geocode_OSM() function from the {tmaptools} package. We first specify a location then get a geocoded coordinate.

FSU.list <- tmaptools::geocode_OSM("Florida State University")
FSU.list
## $query
## [1] "Florida State University"
## 
## $coords
##         x         y 
## -84.29748  30.44236 
## 
## $bbox
##      xmin      ymin      xmax      ymax 
## -84.30650  30.43563 -84.28846  30.44907

The object FSU.list is a list containing three elements query, coords and bbox. You are interested in the bbox element so you save that as vector that you assign FSU_bb and rename the elements to left, bottom, right, and top.

FSU_bb <- FSU.list$bbox
names(FSU_bb) <- c("left", "bottom", 
                   "right", "top")
FSU_bb
##      left    bottom     right       top 
## -84.30650  30.43563 -84.28846  30.44907

You then get the map tiles corresponding to the bounding box from the stamen map service with a zoom of 16 and create the map.

FSU_map <- ggmap::get_stamenmap(bbox = FSU_bb, 
                                zoom = 16)
ggmap(FSU_map)

Add data to the map

Let’s consider a map of Chicago.

CHI_bb <- c(left = -87.936287,
            bottom = 41.679835,
            right = -87.447052,
            top = 42.000835)

CHI_map <- get_stamenmap(bbox = CHI_bb,
                         zoom = 11,
                         messaging = FALSE)
ggmap(CHI_map)

The city of Chicago has a data portal publishing a large volume of public records. Here we look at crime data from 2017. The file car_thefts.csv is a spreadsheet obtained from that portal with a list of car thefts.

You read these data using the readr::read_csv() function.

carTheft <- readr::read_csv(file = "data/car_thefts.csv")
## New names:
## * `` -> ...1
## Rows: 11416 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): Case.Number, Date, Block, IUCR, Primary.Type, Description, Locatio...
## dbl (11): ...1, ID, Beat, District, Ward, Community.Area, X.Coordinate, Y.Co...
## lgl  (2): Arrest, Domestic
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(carTheft)
## # A tibble: 6 × 23
##    ...1       ID Case.Number Date           Block IUCR  Primary.Type Description
##   <dbl>    <dbl> <chr>       <chr>          <chr> <chr> <chr>        <chr>      
## 1     1 10810096 JA107583    01/07/2017 05… 004X… 0910  MOTOR VEHIC… AUTOMOBILE 
## 2     2 10810539 JA108792    01/08/2017 09… 032X… 0930  MOTOR VEHIC… THEFT/RECO…
## 3     3 10811381 JA110666    01/08/2017 07… 011X… 0910  MOTOR VEHIC… AUTOMOBILE 
## 4     4 10811599 JA109921    01/09/2017 04… 061X… 0910  MOTOR VEHIC… AUTOMOBILE 
## 5     5 10811645 JA110998    01/10/2017 07… 049X… 0910  MOTOR VEHIC… AUTOMOBILE 
## 6     6 10811674 JA111011    01/10/2017 05… 037X… 0910  MOTOR VEHIC… AUTOMOBILE 
## # … with 15 more variables: Location.Description <chr>, Arrest <lgl>,
## #   Domestic <lgl>, Beat <dbl>, District <dbl>, Ward <dbl>,
## #   Community.Area <dbl>, FBI.Code <chr>, X.Coordinate <dbl>,
## #   Y.Coordinate <dbl>, Year <dbl>, Updated.On <chr>, Latitude <dbl>,
## #   Longitude <dbl>, Location <chr>

Each row of the data frame is a single report of a vehicle theft. Location is encoded in several ways, though most importantly for us the longitude and latitude of the theft is encoded in the Longitude and Latitude columns, respectively.

You use the geom_point() function to map the location of every theft. Because ggmap() uses the map tiles (here, defined by CHI_map) as the first layer, you specify data and mapping inside of geom_point().

ggmap(CHI_map) +
  geom_point(data = carTheft,
             mapping = aes(x = Longitude,
                           y = Latitude),
             size = .25,
             alpha = .1)
## Warning: Removed 425 rows containing missing values (geom_point).

Note ggmap() replaces ggplot().

More details (extra material)

Instead of relying on geom_point() and plotting the raw data, another approach is to create a heat map. This is done with a density estimator. Since the map has two dimensions and the density estimator requires a ‘kernel’ function the procedure is called a 2-D kernel density estimation (KDE).

KDE will take all the data (i.e. reported vehicle thefts) and convert it into a smoothed plot showing geographic concentrations of crime. KDE is a type of data smoothing where inferences about the population are made based on a finite data sample.

The core function in {ggplot2} to generate this kind of plot is geom_density_2d().

ggmap(CHI_map) +
  geom_density_2d(data = carTheft,
                  aes(x = Longitude,
                      y = Latitude))
## Warning: Removed 425 rows containing non-finite values (stat_density2d).

By default, geom_density_2d() draws a contour plot with lines of constant value. That is, each line represents approximately the same frequency of crime along that specific line. Contour plots are often used in maps (known as topographic maps) to denote elevation.

Rather than drawing lines you fill in the graph by using the fill aesthetic to draw bands of crime density. To do that, you use the related function stat_density_2d().

ggmap(CHI_map) +
  stat_density_2d(data = carTheft,
                  aes(x = Longitude,
                      y = Latitude,
                      fill = stat(level)),
                  geom = "polygon")
## Warning: Removed 425 rows containing non-finite values (stat_density2d).

Note the two new arguments:

  • geom = "polygon" - change the geometric object to be drawn from a geom_density_2d() geom to a polygon geom
  • fill = stat(level) - the value for the fill aesthetic is the level calculated within stat_density_2d(), which you access using the stat() notation.

This is an improvement, but you can adjust some settings to make the graph visually more useful. Specifically,

  • Increase the number of bins, or unique bands of color allowed on the graph
  • Make the colors semi-transparent using alpha so you can still view the underlying map
  • Change the color palette to better distinguish between high and low crime areas.

Here you use RColorBrewer::brewer.pal() from the {RColorBrewer} package to create a custom color palette using reds and yellows.

ggmap(CHI_map) +
  stat_density_2d(data = carTheft,
                  aes(x = Longitude,
                      y = Latitude,
                      fill = stat(level)),
                  alpha = .2,
                  bins = 25,
                  geom = "polygon") +
  scale_fill_gradientn(colors = RColorBrewer::brewer.pal(7, "YlOrRd"))
## Warning: Removed 425 rows containing non-finite values (stat_density2d).

The downtown region has the highest rate of vehicle theft. Not surprising given its population density during the workday. There are also clusters of vehicle thefts on the south and west sides.

Tuesday, September 26, 2022

Today

  • Inferential statistics

How to apply common statistical tests and how to understand their meaning using graphs.

This lesson marks a departure from the earlier lessons. I will continue to teach you how to code, but I will do so in the context of statistical thinking, analysis, and modeling.

I find statistics to be a natural extension to thinking about how the world works but I realize this comes with experience.

The process of drawing conclusions about a population from a sample of data is called inference. Formally referred to as inferential statistics. It is a foundation of data science. Two approaches: frequentist (standard practice) and Bayesian.

Standard practice relies on disproving a research claim that is NOT of interest.

The research claim you want to disprove is called the null hypothesis. For instance, to show that one medical treatment is better than another treatment, you first assume that the two treatments lead to equal survival rates. You then proceed to disprove this null hypothesis with data. Often the other treatment is a placebo (sugar pill).

To show that the climate is getting warmer, you first assume that it is not getting warmer. You then proceed to disprove this hypothesis with data.

Q: What is the difference between the medical treatment example and the climate change example?

One-sample test of the population mean

Oftentimes interest lies in the mean value (from a population of all values) being different than some prescribed value \(M\). So the null hypothesis (what you want to disprove) is that the population mean equals \(M\).

Using textbook notation the test is written as \[ \hbox{H}_0: \mu = M \\ \hbox{H}_A: \mu \neq M \] where H sub naught (\(\hbox{H}_0\)) is the null hypothesis stating that the unknown population mean (\(\mu\)) equals a specific value \(M\) and where H sub A (\(\hbox{H}_A\)) is the alternative hypothesis stating that the unknown population mean does equal \(M\).

For example, given a sample of FSU students where heights are measured in centimeters, you test the hypothesis that the mean height of all students at FSU (the population) is 183 cm (6 feet).

You should always start by plotting the data together with the hypothesis. Here first create a data frame using the vector of heights and number the students from 1 to n using the sequence operator :.

ht <- c(177, 180, 179, 174, 192, 186, 165, 183)
ht.df <- data.frame(Student = 1:length(ht), 
                    Height = ht)

Then use ggplot() to make a box plot and add the hypothesized mean as a layer with the geom_hline() function and the data values as layer with the geom_point() function.

library(ggplot2)

ggplot(data = ht.df, 
       mapping = aes(x = "", y = ht)) + 
  geom_boxplot() +
  geom_point(color = "blue") +
  ylab("Height (cm)") + xlab("") +
  geom_hline(aes(yintercept = 183), color = "red") +
  scale_y_continuous(limits = c(150, 200)) +
  theme_minimal()

mean(ht)
## [1] 179.5

The median height in our sample is less than 180 cm (thick black line) and the hypothesized mean (red line) is within the interquartile range.

You write the test as: \[ \hbox{H}_0: \mu = 183 \\ \hbox{H}_A: \mu \neq 183 \]

You test the hypothesis that the mean height in the population is 183 cm with the t.test() function. The first argument is the data vector (not a data frame) and the second argument is the hypothesized mean (mu =).

t.test(ht, 
       mu = 183)
## 
##  One Sample t-test
## 
## data:  ht
## t = -1.2239, df = 7, p-value = 0.2606
## alternative hypothesis: true mean is not equal to 183
## 95 percent confidence interval:
##  172.7376 186.2624
## sample estimates:
## mean of x 
##     179.5

Where do these values come from and how do you interpret them?

The output includes the \(t\) value (-1.2239). The \(t\) value (or \(t\) statistic) is computed as \[ t = \frac{\bar x - M}{s/\sqrt{n}} \] where \(\bar x\) is the sample mean, \(M\) is the hypothesized value, \(s\) is the standard deviation and \(n\) is the sample size.

In code the \(t\) value is

(mean(ht) - 183) / (sd(ht) / sqrt(length(ht)))
## [1] -1.223853

The output also includes the degrees of freedom (7). The degrees of freedom on the \(t\) value is the sample size minus one. There are eight student heights (sample size is 8) so df = 7.

The degrees of freedom (dof, df) is a term that indicates the number of values in the calculation of a statistic that are ‘free’ to vary. Suppose you know the mean of a set of numbers (say it’s 24) and how many numbers are used to calculate the mean (say there are five). What values could the five numbers have so that the mean is 24? Four of them could be any value, but the fifth one is constrained so that the mean value equals 24.

Thus the mean is a statistic with n - 1 degrees of freedom.

The output also includes the sample mean. The sample mean height of 179.5 cm is shorter than the hypothesized height of 183 cm. But, with only eight values, this amount of “shortness” does not provide us with enough evidence to reject the null hypothesis that the population height is 183 cm. So you conclude by stating that you fail to reject the null hypothesis.

The \(p\)-value which quantifies the evidence in support of the null hypothesis. The smaller the \(p\)-value the less support there is for the null hypothesis. The \(p\)-value is the area under the \(t\) distribution curve to left of the \(t\) value (lower quantile value).

pt(q = -1.2239, df = 7) * 2
## [1] 0.2605821

The pt() function is the cumulative distribution function for the \(t\) distribution. The degrees of freedom is the parameter so you need to include that as the df = argument.

You multiply this probability by 2 because our alternative hypothesis is two-sided (not equal to \(M\)).

The output provides a 95% uncertainty interval about the sample mean. It includes the hypothesized mean height of 183 cm.

The uncertainty (confidence) interval is a statistic that tells us how much uncertainty there is in using the sample mean as an estimate for the population mean. You conclude that, based on our sample of eight students, your best estimate for the mean height of all students at FSU is 179.5 cm with a 95% uncertainty interval that ranges from 172.7 to 186.3 cm.

About p-values

A \(p\)-value is an estimate of the probability that our data, or data more extreme than observed, could occur by chance if the null hypothesis is true. A small \(p\)-value tells us that our data is unusual with respect to the particular null hypothesis.

A bit more explicitly, the \(p\)-value summarizes the evidence in support of the null hypothesis. The smaller the \(p\)-value, the less evidence exists in support of the null hypothesis.

Interpretation of the \(p\)-value is stated as evidence AGAINST the null hypothesis. This is because our interest lies in the null hypothesis being untenable.

\(p\)-value Statement of evidence against the null
less than .01 convincing
.01 - .05 moderate
.05 - .15 suggestive, but inconclusive
greater than .15 no

The \(p\)-value comes from the pt() function, which determines the area under the \(t\) distribution curve to the left of a particular value. The curve is obtained using the dt() function (density function).

For example, to plot the \(t\) distribution curve and the \(t\) value from our hypothesis above you type

curve(dt(x, 7), from = -3, to = 3, lwd = 2)
abline(v = -1.2239, col = 'red')
abline(v = 1.2239, col = 'red')

The area under the curve to the left of -1.2239 is

pt(q = -1.2239, 
   df = 7)
## [1] 0.130291

So 13% of the area lies to the left of the first red line. The distribution is symmetric so 13% of the area lies to the right of the second red line. With a two-sided test you add these two fractions to get the \(p\)-value.

pt(q = -1.2239, df = 7) + pt(q = 1.2239, df = 7, lower.tail = FALSE)
## [1] 0.2605821

Example: Strongest Atlantic hurricanes

Are hurricanes getting stronger? Let’s say you know that the strongest hurricanes in the past have an average minimum pressure of 915 mb. Lower central pressure means a stronger hurricane.

Suppose you collect data on the strength of hurricanes over the period 1980-2017.

Names <- c("Allen", "Gloria", "Gilbert", "Hugo", "Opal", "Mitch", "Isabel", "Ivan", "Katrina", "Rita", "Wilma", "Dean", "Irma", "Maria")
Year <- c(1980, 1985, 1988, 1989, 1995, 1998, 2003, 2004, 2005, 2005, 2005, 2007, 2017, 2017)
minP <- c(899, 919, 888, 918, 916, 905, 915, 910, 902, 895, 882, 905, 914, 908)
hur.df <- data.frame(Year, Names, minP, Basin = "A")

You are interested in whether these recent Atlantic hurricanes since 1980 have an average minimum pressure less than 915 mb. So this is our alternative hypothesis.

Your null hypothesis is that the average minimum pressure (\(\mu\)) is 915 mb or higher and the alternative hypothesis is that it is less than 915.

Formally, you write the statistical test as \[ \hbox{H}_0: \mu \ge 915 \\ \hbox{H}_A: \mu \lt 915 \]

Start with a plot.

ggplot(hur.df, 
       mapping = aes(x = "", y = minP)) + 
  geom_boxplot() +
  geom_point(color = "blue") +
  ylab("Minimum Pressure (mb)") + xlab("") +
  geom_hline(aes(yintercept = 915), color = "red") +
  scale_y_continuous() +
  theme_minimal()

mean(hur.df$minP)
## [1] 905.4286

You see that the data support our idea (hypothesis) that recent hurricanes have, on average, pressures below 915 mb.

You formally test the hypothesis with the t.test() function. The first argument is the data values as a vector (here hur.df$minP), the second argument is the hypothesized mean, and the alternative = argument is set to "less" because that is our alternative hypothesis.

t.test(hur.df$minP, 
       mu = 915,
       alternative = "less")
## 
##  One Sample t-test
## 
## data:  hur.df$minP
## t = -3.1679, df = 13, p-value = 0.003706
## alternative hypothesis: true mean is less than 915
## 95 percent confidence interval:
##      -Inf 910.7792
## sample estimates:
## mean of x 
##  905.4286

Here you summarize/conclude as follows: The sample mean intensity of the recent hurricanes is 905.4 mb, which is less than 915 mb by a difference of about 10 mb.

Given this amount of difference (the effect size) together with a sample size of 14, you conclude there is convincing evidence that, on average, the strongest hurricanes since 1980 are stronger than those in the past.

Graphical inference

Applying a \(t\) test is an example of statistical inference. You draw conclusions about the population from the sample of data. This is why statistics is useful: you don’t want our conclusions to apply only to a sample. You want them to apply to the population at large.

There are two parts: Testing (is there a difference?) and estimation (how big is the difference?). Is there a difference is visual. “Is what you see really there?” More precisely, is what you see in a plot of the sample an accurate reflection of the entire population?

For example, generate samples with the hypothesized mean value. Then see how these samples compare with our data.

rnorm(n = 8, 
      mean = 183, 
      sd = sd(ht))
## [1] 184.8161 180.0985 179.6905 181.2890 181.0646 186.8697 188.2835 173.1945

Consider the situation where you try to ‘find’ your data from a lineup of plots generated from ‘fake’ data under the null hypothesis. By ‘find’ I mean determine which plot corresponds to your data.

The functions in the {nullabor} package generate data sets and plots under various null hypotheses through permutation or simulation.

The nullabor::null_dist() function is used to create another function with arguments var = that specifies the name of the column in our data frame containing the data values of interest, dist = that specifies what type of distribution you assume for your data (here ‘normal’) and params = as a list of the mean and standard deviation of your data.

The function null_dist() generates another function based on your null hypothesis. In this case a normal distribution for the variable Height (in the data frame ht.df) centered on 183 with a standard deviation equal to the standard deviation of the sample (ht).

fun <- nullabor::null_dist(var = "Height", 
                           dist = 'normal', 
                           params = list(mean = 183, sd = sd(ht)))

The magic happens when we use the lineup() function that takes as it’s arguments the saved function (here fun) and the true data frame (here ht.df) and returns a data frame in long format (here assigned to the object dfL).

dfL <- nullabor::lineup(method = fun, 
                        true = ht.df)
## decrypt("h8RX 5IvI ne TAynvnAe k2")
head(dfL)
##   Student   Height .sample
## 1       1 185.2864       1
## 2       2 180.2940       1
## 3       3 180.3991       1
## 4       4 174.8173       1
## 5       5 194.5801       1
## 6       6 193.2948       1
tail(dfL)
##     Student   Height .sample
## 147       3 182.5380      20
## 148       4 190.7539      20
## 149       5 176.0347      20
## 150       6 190.4374      20
## 151       7 193.5517      20
## 152       8 185.0034      20

The data frame contains the random null heights and the observed heights. The heights are listed in the column name Height and by default there are 20 samples each indicated by a number in the column labeled .sample.

The output contains an encryption key that hides the sample number corresponding to the observed heights.

You know how to plot side-by-side using the facet_wrap() function so you can try to visually pick out the observed heights.

ggplot(data = dfL, 
       mapping = aes(x = "", y = Height)) + 
  geom_boxplot() + 
  facet_wrap(~ .sample, nrow = 1, ncol = 20) +
  theme_minimal()

Can you pick out the actual data? A plot of the real data is hidden amount the 19 ‘fake’ data. The fakes are plots of data generated from the null hypothesis. If you can spot the real data, then there is evidence that your data is different from the null hypothesis.

With a null hypothesis stating that the mean height is 183 cm, the evidence is weak that the data is different. So you fail to reject the null.

Suppose the null hypothesis states that the mean height is 190 cm. Retest with the new null hypothesis.

t.test(ht, 
       mu = 190)
## 
##  One Sample t-test
## 
## data:  ht
## t = -3.6716, df = 7, p-value = 0.007948
## alternative hypothesis: true mean is not equal to 190
## 95 percent confidence interval:
##  172.7376 186.2624
## sample estimates:
## mean of x 
##     179.5

In this case, the sample of students looks unusual (they are too short) if the true height is 190 cm. The \(p\)-value is reduced to .008.

The uncertainty interval does not change. The uncertainty interval is about the sample as an estimate of the unknown population mean regardless of what you think the mean is.

The value of 190 lies outside the uncertainty interval consistent with a small \(p\) value.

Repeat the graphical lineup but this time with a hypothesized mean of 190 cm.

fun <- nullabor::null_dist(var = "Height", 
                           dist = 'normal', 
                           params = list(mean = 190, sd = sd(ht)))
inf <- nullabor::lineup(fun, ht.df)
## decrypt("h8RX 5IvI ne TAynvnAe YQ")
ggplot(inf, aes(x = "", y = Height)) + 
  geom_boxplot() + 
  facet_wrap(~ .sample, nrow = 1, ncol = 20) +
  theme_minimal()

In this case it is easier to pick out the actual data. And this ability to pick out the actual data corresponds with a lower \(p\)-value.

The idea of comparing your actual data to a set of data generated under a null hypothesis is a fundamental idea of statistics.

Said another way: you have data and want to use it to say something about the world. You do this by comparing your data with data under other possible realities (hypotheticals–“what if?” or counterfactuals).

The possible realities are generated with computer code using statistical distributions.

Graphical inference can be used in a variety of applications. Here is an example from my tornado research. Here the innocent is a spatially random distribution of genesis. Can we pick out the actual data?

More information on the {nullabor} package here

Two-sample test of the difference in population means

With two data samples the null hypothesis is that the two samples have the same population mean. You assume both samples can be modeled with a normal distribution.

You test the null hypothesis that the two samples have the same population mean by computing the \(t\) value. In this case, the \(t\) value is the difference in sample means divided by the standard error of the difference in means (SEDM).

There are two ways to calculate SEDM.

    1. Assume equal variance: use the pooled standard deviation (\(s\)). Under the null hypothesis, the \(t\) value will follow a \(t\) distribution with n1 + n2 - 2 degrees of freedom (df).
    1. Don’t assume equal variances (this is the default assumption). Under the null hypothesis, the \(t\) statistic approximates a \(t\) distribution. In this case it is called the Welch procedure and the degrees of freedom is not an integer.

Usually the methods give similar results (unless group sizes and variances are widely different between the two samples).

Example: hurricanes in the North Atlantic and the eastern North Pacific

Are hurricanes that occur over the eastern North Pacific weaker (or stronger) than those that occur over the Atlantic? Let’s look at the evidence.

You add the data from the Pacific to the data frame hur.df. You do this by making a similar data frame then using the rbind() function to combine them.

Names <- c("Trudy", "Gilma", "Olivia", "Guillermo", "Linda", "Juliette", "Elida", "Hernan", "Kenna", "Ioke", "Rick", "Celia", "Marie", "Odile", "Patricia", "Lane", "Walaka")
Year <- c(1990, 1994, 1994, 1997, 1997, 2001, 2002, 2002, 2002, 2006, 2009, 2010, 2014, 2014, 2015, 2018, 2018)
minP <- c(924, 920, 923, 919, 902, 923, 921, 921, 913, 915, 906, 921, 918, 918, 872, 922, 920)

df <- data.frame(Year, Names, minP, Basin = "P")

hur.df <- rbind(hur.df, df)

You start with a lineup of plots where you shuffle (permute) the minimum pressures between the basins multiple times. This is done with the nullabor::null_permute() function identifying what column to permute.

fun <- nullabor::null_permute("Basin")
inf <- nullabor::lineup(fun, hur.df, n = 12)
## decrypt("h8RX 5IvI ne TAynvnAe kk")
ggplot(inf, aes(x = Basin, y = minP, color = Basin)) + 
 geom_boxplot() + 
 facet_wrap(~ .sample)

Based on these plots, what do you anticipate you will conclude when you formalize this with a \(t\) test? This is important.

Let \(\mu_{A}\) be the population mean of Atlantic hurricanes and \(\mu_{P}\) be the population mean of Pacific hurricanes, you formally write the statistical test as \[ \hbox{H}_0: \mu_{A} = \mu_{P} \\ \hbox{H}_A: \mu_{A} \neq \mu_{P} \] You implement the test as follows

t.test(minP ~ Basin, 
       data = hur.df, 
       var.equal = TRUE,
       alternative = "two.sided")
## 
##  Two Sample t-test
## 
## data:  minP by Basin
## t = -2.2407, df = 29, p-value = 0.03287
## alternative hypothesis: true difference in means between group A and group P is not equal to 0
## 95 percent confidence interval:
##  -18.6455942  -0.8502041
## sample estimates:
## mean in group A mean in group P 
##        905.4286        915.1765

You write a summary and conclusion as follows: “The sample mean intensity of Atlantic hurricanes is 905.4 mb and the sample mean intensity of the Pacific hurricanes is 915.2 mb. Given this amount of difference (the effect size) together with a sample size of 29, you conclude there is moderate evidence that the mean hurricane intensity in the Atlantic is different than the mean hurricane intensity of the Pacific.”

Example: Palmer penguins

Do Adelie penguins have shorter flippers than Chinstrap penguins? The data frame called penguins is available as part of the {palmerpenguins} package.

library(palmerpenguins)
## 
## Attaching package: 'palmerpenguins'
## The following object is masked _by_ '.GlobalEnv':
## 
##     penguins
head(penguins)
## # A tibble: 6 × 8
##   species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex  
##   <chr>   <chr>           <dbl>         <dbl>            <dbl>       <dbl> <chr>
## 1 Adelie  Torge…           39.1          18.7              181        3750 male 
## 2 Adelie  Torge…           39.5          17.4              186        3800 fema…
## 3 Adelie  Torge…           40.3          18                195        3250 fema…
## 4 Adelie  Torge…           NA            NA                 NA          NA <NA> 
## 5 Adelie  Torge…           36.7          19.3              193        3450 fema…
## 6 Adelie  Torge…           39.3          20.6              190        3650 male 
## # … with 1 more variable: year <dbl>

Remove the rows corresponding to the larger Gentoo penguins.

penguins <- penguins |>
  dplyr::filter(species != "Gentoo")
ggplot(data = penguins, 
       mapping = aes(x = species, y = flipper_length_mm)) +
  geom_boxplot(mapping = aes(color = species), 
               width = .3, show.legend = FALSE) +
  geom_jitter(mapping = aes(color = species), 
              alpha = .5, show.legend = FALSE, 
              position = position_jitter(width = 0.2, seed = 0)) +
  scale_color_manual(values = c("darkorange","purple")) +
  labs(x = "Species", y = "Flipper length (mm)") +
  theme_minimal()
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).
## Warning: Removed 1 rows containing missing values (geom_point).

We see that, on average, Adelie penguins have shorter flippers than Chinstrap penguins, but there is substantial variability from one individual to another.

Again you start with a lineup of plots where you permute flipper length between the species.

fun <- nullabor::null_permute("species")
inf <- nullabor::lineup(fun, penguins, n = 12)
## decrypt("h8RX 5IvI ne TAynvnAe ku")
ggplot(inf, 
       mapping = aes(x = species, color = species, y = flipper_length_mm)) + 
  geom_boxplot() + 
  scale_color_manual(values = c("darkorange","purple")) +
  facet_wrap(~ .sample)
## Warning: Removed 12 rows containing non-finite values (stat_boxplot).

What do you anticipate you will conclude when you formalize this with a \(t\) test?

t.test(flipper_length_mm ~ species, 
       data = penguins,
       var.equal = TRUE,
       alternative = "less")
## 
##  Two Sample t-test
## 
## data:  flipper_length_mm by species
## t = -5.974, df = 217, p-value = 4.689e-09
## alternative hypothesis: true difference in means between group Adelie and group Chinstrap is less than 0
## 95 percent confidence interval:
##       -Inf -4.246781
## sample estimates:
##    mean in group Adelie mean in group Chinstrap 
##                189.9536                195.8235

What do you write in your summary and conclusions?

Adelie penguins in the sample have a mean flipper length of 190 mm, which is shorter than the mean flipper length of the Chinstrap penguins by about 6 mm. Given a sample size of 218 penguins this difference provides convincing evidence against the null hypothesis that population mean flipper length is the same (or longer) for the Chinstrap penguins.

Your turn

Perform a test of the hypothesis that females have shorter bill lengths.

t.test(___ ~ ___, 
       data = ___,
       var.equal = TRUE,
       alternative = ___)

Test of equal variance

In the test of the population means above we assume that the variability among penguins is the same regardless of species (var.equal = TRUE).

You check this assumption by computing the variance by species.

penguins |>
  dplyr::group_by(species) |>
  dplyr::summarize(varFL = var(flipper_length_mm, na.rm = TRUE))
## # A tibble: 2 × 2
##   species   varFL
##   <chr>     <dbl>
## 1 Adelie     42.8
## 2 Chinstrap  50.9

There is less variance in flipper length for the sample of Adelie penguins compared with the variance in flipper length for the sample of Chinstrap penguins. But is this difference significant?

The ratio of the two variances is about .84.

You formally test with the var.test() function under the null hypothesis that the ratio of the two variances is equal to 1.

var.test(flipper_length_mm ~ species, 
         data = penguins)
## 
##  F test to compare two variances
## 
## data:  flipper_length_mm by species
## F = 0.84076, num df = 150, denom df = 67, p-value = 0.3854
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.5489336 1.2465576
## sample estimates:
## ratio of variances 
##          0.8407631

The output shows that the ratio of the variances is .8407. Under the null hypothesis that the true ratio is 1 the ratio follows an F distribution with 150 and 67 degrees of freedom which gives a two-sided \(p\)-value of .3854. This is why it is sometimes called the ‘F-test.’

Thus you conclude there is no statistical evidence of a difference in the variability in flipper length between the two species.

The uncertainty interval includes the value of 1 and is quite wide.

It should be kept in mind that the test of common variance is sensitive to small departures from a normal distribution and it is based on the assumption that the groups are independent. It should not be applied in the setting where the data are paired.

The t.test() and var.test() functions are in the {stats} package as part of the base install of R.

The {ctest} package contains all the “classical tests,” and has several alternative tests for variance homogeneity, each with its own assumptions, benefits, and drawbacks.

Wilcoxon (Mann-Whitney U) non-parameteric test of difference in means

You can avoid the distributional assumption (assume data have a normal distribution) by using a non-parametric test. The non-parametric alternative is the Wilcoxon test. Also known as the Mann-Whitney U test.

The test statistic ‘W’ is the sum of the ranks in the first group minus the sum of the ranks in the second. It is obtained with the wilcox.test() function.

For example, is there evidence of more or fewer U.S. hurricanes recently? One way to examine this question is to divide the time period into two samples and compare the means from both samples.

loc <- "http://myweb.fsu.edu/jelsner/temp/data/US.txt"
LH.df <- read.table(loc, 
                    header = TRUE)

You consider the first half of the record as separate from the second half and ask is there a difference in hurricane counts between the two halves. The null hypothesis is that the sample means are the same.

First create a vector that divides the record length in two equal parts.

early <- LH.df$Year <= median(LH.df$Year)
head(early); tail(early)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE
## [1] FALSE FALSE FALSE FALSE FALSE FALSE

Then use the test on the U.S. hurricane counts where the explanatory variable is the vector early.

t.test(LH.df$All ~ early)
## 
##  Welch Two Sample t-test
## 
## data:  LH.df$All by early
## t = -0.59291, df = 162.53, p-value = 0.5541
## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
## 95 percent confidence interval:
##  -0.5739157  0.3088554
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##             1.60241             1.73494

The \(p\)-value is large (> .15) so you fail to reject the null hypothesis of no difference in mean number of hurricanes between the earlier and the later periods.

The 95% uncertainty interval is centered the difference in means. Since the interval contains zero, there is no evidence against the null hypothesis.

Since there are 160 years in the record (length(LH.df$All)) you take the first 80 years for the first sample (s1) and the next 80 years for the second sample (s2) and then perform the test. Try it now.

s1 <- LH.df$All[early]
s2 <- LH.df$All[!early]

Small counts are not described very well by a normal distribution.

ggplot(data = LH.df, 
       mapping = aes(factor(All))) + 
  geom_bar() + 
  ylab("Number of Years") + 
  xlab("Number of Hurricanes")

So you use the non-parametric Wilcoxon test instead.

wilcox.test(s1, s2)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  s1 and s2
## W = 3592.5, p-value = 0.6239
## alternative hypothesis: true location shift is not equal to 0

The \(p\) value again exceeds .15 so your conclusion is the same. The average number of hurricanes during the second half of the record is statistically indistinguishable from the average number of hurricanes during the first half of the record.